Hi, Jark, Timo. Nice to have an agreement!
Thanks for Jark's inputs about the multiple version Flink. I have already
updated the FLIP in the rejected alternatives about details.
1. We should definitely just use LogicalTypeJsonSerializer and not a
second JSON representation.
Our concern is mainly that it's hard for users to use because of the
flexible structure. The LogicalTypeJsonSerializer will serialize the
VARCHAR to "VARCHAR(<LENGTH>)" or "{\"TYPE\": \"VARCHAR\", \"LENGTH\": 0}",
which requires the end users to process the different situations. But in
some cases, users just print the json to the terminal/web UI. WDYT?
Serialize the RowData
Sure. I will keep your advice in mind. I think the current serialization
of the RowData will not use the column name as the Object key in the json.
I am not sure whether I missed something. It would be nice if you can give
me an example if I do something wrong.
Have you also thought about using Flink's state types from Flink
tasks/jobs?
Yes. But I still think we should use a new state machine. First of all,
Operation in the FLIP is much different from the Job. Operations include
DDL, DML and so on. So it's not suitable to use the small concept to
replace the big concept. Actually some status in the JobStatus, e.g.
RESTARTING/SUSPENDED/RECONCILING don't work in the DDL Operation.
On the other hand, the Gateway allows users to submit jobs(DML) in
sync/async mode. The running status in the Operation Status in the
different mode has different meaning:
- In the async mode, when the gateway submits the job, the state comes to
the FINISHED state
- In the sync mode, the running status in the Operation status includes
submitting the job, running job. Even if a failover occurs, we still think
that this Operation is in the RUNNING state. Unless the job is
unrecoverable, we change the Operation status to ERROR.
Therefore, I think these two concepts are not consistent and we should not
reuse the JobStatus. I add a section in the rejected alternatives.
Options to configure the REST endpoint
Yes. I have modified the FLIP about this.
Naming conversion
Yes. I have modified the FLIP with your suggestions.
Another smaller shortcomings in the FLIP
SQLGatewayService.getFunction / UserDefinedFunctionInfo
After reviewing the java.sql.DatabaseMetaData#getFunctions's java doc, I
find it will return the system and user functions available in the Catalog.
I think you are right. Therefore, we'd better to rename to the
listFunctions(SessionHandle sessionHandle, OperationHandle operationHandle,
String catalog, String database, ShowFunctionsOperation.FunctionScope) and
it returns FunctionInfo.
SQLGatewayService.getGatewayInfo()/getSessionConfig
The result of the SQLGatewayService.getGatewayInfo and getSessionConfig is
not used by the endpoint. The endpoint just serializes the result and
presents it to the users. If we use the ReadableConfig, it's hard for us to
iterate all the key value pairs.
configure_session VS initialize_session
If calling it initialize_session, should we limit it only being called
once?
If we limit it only being called once, it allows the input of the
initialize_session script. But the current design in the Gateway is aligned
with the TableEnvironment#executeSql. That is, the input of the statement
is a single statement rather than the script. Considering the API in the
FLIP is not as same as the initialization in the CLI, I think we can use
the configure_session? What do you think, Timo?
Best,
Shengkai
Timo Walther <twal...@apache.org> 于2022年5月16日周一 14:28写道:
Hi Shengkai, Hi Jark,
thanks for the additional explanation and the update of the FLIP. This
will help us in the future for documenting our decisions. The arguments
why to include the Gateway into the main repo make a lot of sense to me.
Esp. also because both CLI and gateway need some parsing functionality
that is dependent on the current state of the SQL syntax.
Here is my last set of feedback, other than that +1 for the proposal:
Serialize the LogicalType
The FLIP mentions LogicalTypeJsonSerializer but the shown JSON is
different from the current master. We are using the serializable
representation of LogicalType as much as possible nowadays. We should
definitely just use LogicalTypeJsonSerializer and not a second JSON
representation.
1) Serialize the RowData
Side note for serializing ROWs: we should not use field names in JSON
object keys. As e.g. `null` and other names with special characters
cause issues in JSON.
2) We propose the state machine like HiveServer2
Have you also thought about using Flink's state types from Flink
tasks/jobs? If we were using Flink types directly, it would be easier to
monitor the execution of a INSERT INTO job via the gateway without
having to map state types. Monitoring jobs is the most important
functionality and should be in sync with regular Flink job monitoring. A
HiveServer2 endpoint can still perform mapping if necessary, but within
the Flink code base we use a consistent state transition scheme.
3) Options to configure the REST endpoint
Given that we also want to be able to put endpoint options into
flink-conf.yaml, we should make those option a bit more globally
understandable:
endpoint.type -> sql-gateway.endpoint.type
endpoint.rest.port -> sql-gateway.endpoint.rest.port
...
4) Naming conversion
This is very nit picking, but please make sure to name the interfaces
consistently for abbreviations. Currently, we mostly use `SqlInterface`
instead of `SQLInterface`. So in the FLIP:
SQLGatewayEndpoint -> SqlGatewayEndpoint
SQLGatewayEndpointFactory -> SqlGatewayEndpointFactory
e.g. for REST we do it already: RestEndpointVersion
5) Another smaller shortcomings in the FLIP
SQLGatewayService.getFunction / UserDefinedFunctionInfo
Was it a conscious decision to not use FunctionIdentifier here? You will
not be able to list system functions.
SQLGatewayService.getGatewayInfo()/getSessionConfig
Why not returning ReadableConfig here?
Thanks,
Timo
On 12.05.22 17:02, Jark Wu wrote:
Hi Shengkai,
One comment on the configure_session VS initialize_session,
I think configure_session is better than initialize_session,
because "initialize" sounds like this method can only be invoked once.
But I can see this method is useful to be invoked several times to
configure
the session during the lifecycle of a session. If calling it
initialize_session,
should we limit it only be called once?
Best,
Jark
On Thu, 12 May 2022 at 22:38, Jark Wu <imj...@gmail.com
<mailto:imj...@gmail.com>> wrote:
Hi Timo,
I agree with Shengkai that we should keep Gateway in the Flink main
repo.
Here are my thoughts:
1) It is not feasible and maintainable to invoke different versions
of Flink dependencies
in one JVM. I'm afraid it would be a nightmare when the code is
messed up with all the reflections.
2) Actually, it's very mature to support submitting multi-version by
deploying one
gateway process for one version. Many systems are following this
architecture as Shengkai
mentioned above, including the Ververica Platform. And I don't see
any problem with it.
3) As Jingsong said, SQL CLI should build on top of Gateway to share
code and have the ability to
connect to Gateway. If SQL CLI is still kept in the Flink repo, we
have to put Gateway there as well.
4) Besides, Gateway is indispensable for a SQL engine (think of
Trino/Presto, Spark, Hive).
Otherwise, Flink will always be a processing system. With Gateway
inside the Flink repo,
we can provide an out-of-box experience as a SQL query engine. Users
can try out the gateway
for the latest version when a new version is released.
5) Last, Gateway inside the Flink repo can ensure the highest degree
of version compatibility.
Best,
Jark
On Thu, 12 May 2022 at 19:19, Shengkai Fang <fskm...@gmail.com
<mailto:fskm...@gmail.com>> wrote:
Hi Timo.
Thanks for your feedback!
> It would be great if you can copy/paste some of the tricky
design
decisions into the FLIP.
Yes. I have already added a section about the `Rejected
Alternatives`. It
includes the discuss about the
- TableInfo and FunctionInfo VS CatalogTable and CatalogFunction
- Support the multi-version Flink in the Gateway VS Support the
multi-version in the external Service
- Merge Gateway into the Flink code base VS Support Gateway in
the another
repo
...
> Separate code base
Let me summarize all the discussion about the separate code
base. Please
correct me if anything is wrong.
About support for the Gateway in the Flink code base.
1. It reduces the cost including tests, release and document. We
don't need
to upgrade the Gateway when Flink releases.
2. The SQL Client has the ability to submit the SQL to the
Gateway. If we
move the Gateway to the outside repo, the SQL Client doesn't
have the
related dependencies.
3. It is benefit for us to reuse code in the same repo.
About supporting the Gateway in another repo.
1. It is easier to support the multi-version Flink;
First of all, I think we should support the multiple Flink
version with
another component that uses the Gateway from the individual
Flink releases.
Zeppelin, Livy, and Kyuubi all use similar architecture.
The current Gateway in the FLIP is bound to the specific Flink
version,
which needs to compile the SQL and submit the job to the cluster
with the
specified Flink version.
We can introduce a service in the future(Maybe we can name it
MultipleVersionFlinkService? ). It may tell the client where the
Gateway
with specified version is and the client just communicate with
the Gateway
later. It may also just parse the request, convert the request
to the REST
API in the FLIP and send to the Gateway with specified version.
Therefore, I think we should merge the SQL Gateway inside the
Flink repo.
The MultipleVersionFlinkService may be inside or outside the
repo.
I add more details in the `Rejected Alternatives` in the FLIP.
> Missing REST API spec
Good point! I add a section in the `Appendix` about the
serialization of
the ResultSet. It includes how to serialize the schema, RowData
and
exceptions.
> Consistency of concepts
configure_session VS initialize_session
Okay. I think it's better to use the initialize_session.
TableInfo + UserDefinedFunctionInfo VS CatalogTable and
CatalogFunction
The CatalogTable and CatalogFunction are much heavier than the
TableInfo
and UserDefinedFunction. The CatalogManager requires reading
from the
Catalog to get the schema. But in the listTables only care about
the table
name, which is much lighter. Therefore, we propose to use the
TableInfo
with required fields.
> Result Retrieval
Yes. Currently the API is only used for the preview. I added a
section in
the `Rejected Alternatives` about this.
Best,
Shengkai