[DISCUSS] Unify flink configuration #1857

Huajie Wang Thu, 20 Oct 2022 01:48:31 -0700

Hello everyone, this discussion is about the development of a unified
specification for flink'job profiles in streampark, Welcome to join the
discussion





*background:*
Streampark is positioned as a rapid development framework such as flink &
spark. An important part of it is standardized configuration: put all the
configurations hardcoded in the code into the configuration file. When the
project starts, you only need to pass in the agreed configuration. The file
can complete the initialization of the environment and the setting of
parameters,
Because the parameter specification customization in the current version is
somewhat unreasonable, the specific performance is as follows: the format
of the parameter is redefined, which is slightly different from the
official configuration of flink. For this part, pr has already done related
work [1], the specific method It is to put the parameter settings of env in
flink under property.
The key is the key of the standard parameter in flink,  but this part only
regulates the parameter configuration under the property, and does not
regulate the global parameter setting.


The current configuration rules is as follows:

flink:
  deployment:
    option:
        ...
    property:
        ...


For example: Now the flink'job is deployed in yarn-perjob mode, the job
name is: test-job, the parallelism is 2, and the entity class is:
org.apache.streampark.FlinkJob, so the configuration is as follows:

flink:
  deployment:
    option:
        target: yarn-per-job
    property:
        $internal.application.main: org.apache.streampark.FlinkJob
        pipeline.name: test-job
        taskmanager.numberOfTaskSlots: 1
        parallelism.default: 2


we can see, root prefix is `flink`, The `option` defined the parameters
related to the deployment task,
and the `property` defined the parameter configuration in flink. The
configurable parameters is completely consistent with the standard
parameters in flink [2], There are deficiencies in this design
specification, which are manifested as follows:

1. The format of table-related parameter settings is not defined
2. The user's business parameters are not defined
3. The content of flinksql is not defined.

Therefore, the purpose of this discussion is to solve this problem and
further standardize the parameters. Since the design of this part of the
specification is more important, it will directly affect the users
developed with the streampark api, so it is necessary for us to conduct
in-depth communication and discussion.



*Proposal:*
The improved format I initially proposed[3] is for example, the parameters
are generally divided into three parts, env, app, sql, "env" defined
deployment parameters and environment setting related parameters, and table
parameters, "app" defined user-defined parameters, "sql" defined the
content of flinksql.

env:
  option: #cli opiton args
    target: yarn-application # yarn-application, yarn-perjob
    shutdownOnAttachedExit:
    jobmanager:
    ...
  property:
    ${StreamExecutionEnvironment.key} : $value
    ...
    table:
      ${TableEnvironment.key} : $value
      ...
sql: # flinksql
   my_flinksql: |
    CREATE TABLE datagen (
      f_sequence INT,
      ts AS localtimestamp,
      WATERMARK FOR ts AS ts
    ) WITH (
      ....
    );
    ...

app:
    kafka.bootstrap:
    kafka.topic: test
    ...


Looking forward to your opinion.



[1] : https://github.com/apache/incubator-streampark/issues/1762
[2] :
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/config/
[3] : https://github.com/apache/incubator-streampark/issues/1857

Best,
Huajie Wang



Best,
Huajie Wang

[DISCUSS] Unify flink configuration #1857

Reply via email to