Re: [DISCUSS] FLIP-84 Feedback Summary

godfrey he Mon, 06 Apr 2020 07:39:27 -0700

Hi Timo,

Sorry for the late reply, and thanks for your correction.
I missed DQL for job submission scenario.
I'll fix the document right away.


Best,
Godfrey

Timo Walther <[email protected]> 于2020年4月3日周五 下午9:53写道：

> Hi Godfrey,
>
> I'm sorry to jump in again but I still need to clarify some things
> around TableResult.
>
> The FLIP says:
> "For DML, this method returns TableResult until the job is submitted.
> For other statements, TableResult is returned until the execution is
> finished."
>
> I thought we agreed on making every execution async? This also means
> returning a TableResult for DQLs even though the execution is not done
> yet. People need access to the JobClient also for batch jobs in order to
> cancel long lasting queries. If people want to wait for the completion
> they can hook into JobClient or collect().
>
> Can we rephrase this part to:
>
> The FLIP says:
> "For DML and DQL, this method returns TableResult once the job has been
> submitted. For DDL and DCL statements, TableResult is returned once the
> operation has finished."
>
> Regards,
> Timo
>
>
> On 02.04.20 05:27, godfrey he wrote:
> > Hi Aljoscha, Dawid, Timo,
> >
> > Thanks so much for the detailed explanation.
> > Agree with you that the multiline story is not completed now, and we can
> > keep discussion.
> > I will add current discussions and conclusions to the FLIP.
> >
> > Best,
> > Godfrey
> >
> >
> >
> > Timo Walther <[email protected]> 于2020年4月1日周三 下午11:27写道：
> >
> >> Hi Godfrey,
> >>
> >> first of all, I agree with Dawid. The multiline story is not completed
> >> by this FLIP. It just verifies the big picture.
> >>
> >> 1. "control the execution logic through the proposed method if they know
> >> what the statements are"
> >>
> >> This is a good point that also Fabian raised in the linked google doc. I
> >> could also imagine to return a more complicated POJO when calling
> >> `executeMultiSql()`.
> >>
> >> The POJO would include some `getSqlProperties()` such that a platform
> >> gets insights into the query before executing. We could also trigger the
> >> execution more explicitly instead of hiding it behind an iterator.
> >>
> >> 2. "there are some special commands introduced in SQL client"
> >>
> >> For platforms and SQL Client specific commands, we could offer a hook to
> >> the parser or a fallback parser in case the regular table environment
> >> parser cannot deal with the statement.
> >>
> >> However, all of that is future work and can be discussed in a separate
> >> FLIP.
> >>
> >> 3. +1 for the `Iterator` instead of `Iterable`.
> >>
> >> 4. "we should convert the checked exception to unchecked exception"
> >>
> >> Yes, I meant using a runtime exception instead of a checked exception.
> >> There was no consensus on putting the exception into the `TableResult`.
> >>
> >> Regards,
> >> Timo
> >>
> >> On 01.04.20 15:35, Dawid Wysakowicz wrote:
> >>> When considering the multi-line support I think it is helpful to start
> >>> with a use case in mind. In my opinion consumers of this method will
> be:
> >>>
> >>>   1. sql-client
> >>>   2. third-part sql based platforms
> >>>
> >>> @Godfrey As for the quit/source/... commands. I think those belong to
> >>> the responsibility of aforementioned. I think they should not be
> >>> understandable by the TableEnvironment. What would quit on a
> >>> TableEnvironment do? Moreover I think such commands should be prefixed
> >>> appropriately. I think it's a common practice to e.g. prefix those with
> >>> ! or : to say they are meta commands of the tool rather than a query.
> >>>
> >>> I also don't necessarily understand why platform users need to know the
> >>> kind of the query to use the proposed method. They should get the type
> >>> from the TableResult#ResultKind. If the ResultKind is SUCCESS, it was a
> >>> DCL/DDL. If SUCCESS_WITH_CONTENT it was a DML/DQL. If that's not enough
> >>> we can enrich the TableResult with more explicit kind of query, but so
> >>> far I don't see such a need.
> >>>
> >>> @Kurt In those cases I would assume the developers want to present
> >>> results of the queries anyway. Moreover I think it is safe to assume
> >>> they can adhere to such a contract that the results must be iterated.
> >>>
> >>> For direct users of TableEnvironment/Table API this method does not
> make
> >>> much sense anyway, in my opinion. I think we can rather safely assume
> in
> >>> this scenario they do not want to submit multiple queries at a single
> >> time.
> >>>
> >>> Best,
> >>>
> >>> Dawid
> >>>
> >>>
> >>> On 01/04/2020 15:07, Kurt Young wrote:
> >>>> One comment to `executeMultilineSql`, I'm afraid sometimes user might
> >>>> forget to
> >>>> iterate the returned iterators, e.g. user submits a bunch of DDLs and
> >>>> expect the
> >>>> framework will execute them one by one. But it didn't.
> >>>>
> >>>> Best,
> >>>> Kurt
> >>>>
> >>>>
> >>>> On Wed, Apr 1, 2020 at 5:10 PM Aljoscha Krettek<[email protected]>
> >> wrote:
> >>>>
> >>>>> Agreed to what Dawid and Timo said.
> >>>>>
> >>>>> To answer your question about multi line SQL: no, we don't think we
> >> need
> >>>>> this in Flink 1.11, we only wanted to make sure that the interfaces
> >> that
> >>>>> we now put in place will potentially allow this in the future.
> >>>>>
> >>>>> Best,
> >>>>> Aljoscha
> >>>>>
> >>>>> On 01.04.20 09:31, godfrey he wrote:
> >>>>>> Hi, Timo & Dawid,
> >>>>>>
> >>>>>> Thanks so much for the effort of `multiline statements supporting`,
> >>>>>> I have a few questions about this method:
> >>>>>>
> >>>>>> 1. users can well control the execution logic through the proposed
> >> method
> >>>>>>     if they know what the statements are (a statement is a DDL, a
> DML
> >> or
> >>>>>> others).
> >>>>>> but if a statement is from a file, that means users do not know what
> >> the
> >>>>>> statements are,
> >>>>>> the execution behavior is unclear.
> >>>>>> As a platform user, I think this method is hard to use, unless the
> >>>>> platform
> >>>>>> defines
> >>>>>> a set of rule about the statements order, such as: no select in the
> >>>>> middle,
> >>>>>> dml must be at tail of sql file (which may be the most case in
> product
> >>>>>> env).
> >>>>>> Otherwise the platform must parse the sql first, then know what the
> >>>>>> statements are.
> >>>>>> If do like that, the platform can handle all cases through
> >> `executeSql`
> >>>>> and
> >>>>>> `StatementSet`.
> >>>>>>
> >>>>>> 2. SQL client can't also use `executeMultilineSql` to supports
> >> multiline
> >>>>>> statements,
> >>>>>>     because there are some special commands introduced in SQL
> client,
> >>>>>> such as `quit`, `source`, `load jar` (not exist now, but maybe we
> need
> >>>>> this
> >>>>>> command
> >>>>>>     to support dynamic table source and udf).
> >>>>>> Does TableEnvironment also supports those commands?
> >>>>>>
> >>>>>> 3. btw, we must have this feature in release-1.11? I find there are
> >> few
> >>>>>> user cases
> >>>>>>     in the feedback document which behavior is unclear now.
> >>>>>>
> >>>>>> regarding to "change the return value from `Iterable<Row` to
> >>>>>> `Iterator<Row`",
> >>>>>> I couldn't agree more with this change. Just as Dawid mentioned
> >>>>>> "The contract of the Iterable#iterator is that it returns a new
> >> iterator
> >>>>>> each time,
> >>>>>>     which effectively means we can iterate the results multiple
> >> times.",
> >>>>>> we does not provide iterate the results multiple times.
> >>>>>> If we want do that, the client must buffer all results. but it's
> >>>>> impossible
> >>>>>> for streaming job.
> >>>>>>
> >>>>>> Best,
> >>>>>> Godfrey
> >>>>>>
> >>>>>> Dawid Wysakowicz<[email protected]>  于2020年4月1日周三 上午3:14写道：
> >>>>>>
> >>>>>>> Thank you Timo for the great summary! It covers (almost) all the
> >> topics.
> >>>>>>> Even though in the end we are not suggesting much changes to the
> >> current
> >>>>>>> state of FLIP I think it is important to lay out all possible use
> >> cases
> >>>>>>> so that we do not change the execution model every release.
> >>>>>>>
> >>>>>>> There is one additional thing we discussed. Could we change the
> >> result
> >>>>>>> type of TableResult#collect to Iterator<Row>? Even though those
> >>>>>>> interfaces do not differ much. I think Iterator better describes
> that
> >>>>>>> the results might not be materialized on the client side, but can
> be
> >>>>>>> retrieved on a per record basis. The contract of the
> >> Iterable#iterator
> >>>>>>> is that it returns a new iterator each time, which effectively
> means
> >> we
> >>>>>>> can iterate the results multiple times. Iterating the results is
> not
> >>>>>>> possible when we don't retrieve all the results from the cluster at
> >>>>> once.
> >>>>>>> I think we should also use Iterator for
> >>>>>>> TableEnvironment#executeMultilineSql(String statements):
> >>>>>>> Iterator<TableResult>.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>>
> >>>>>>> Dawid
> >>>>>>>
> >>>>>>> On 31/03/2020 19:27, Timo Walther wrote:
> >>>>>>>> Hi Godfrey,
> >>>>>>>>
> >>>>>>>> Aljoscha, Dawid, Klou, and I had another discussion around
> FLIP-84.
> >> In
> >>>>>>>> particular, we discussed how the current status of the FLIP and
> the
> >>>>>>>> future requirements around multiline statements, async/sync,
> >> collect()
> >>>>>>>> fit together.
> >>>>>>>>
> >>>>>>>> We also updated the FLIP-84 Feedback Summary document [1] with
> some
> >>>>>>>> use cases.
> >>>>>>>>
> >>>>>>>> We believe that we found a good solution that also fits to what is
> >> in
> >>>>>>>> the current FLIP. So no bigger changes necessary, which is great!
> >>>>>>>>
> >>>>>>>> Our findings were:
> >>>>>>>>
> >>>>>>>> 1. Async vs sync submission of Flink jobs:
> >>>>>>>>
> >>>>>>>> Having a blocking `execute()` in DataStream API was rather a
> >> mistake.
> >>>>>>>> Instead all submissions should be async because this allows
> >> supporting
> >>>>>>>> both modes if necessary. Thus, submitting all queries async sounds
> >>>>>>>> good to us. If users want to run a job sync, they can use the
> >>>>>>>> JobClient and wait for completion (or collect() in case of batch
> >> jobs).
> >>>>>>>>
> >>>>>>>> 2. Multi-statement execution:
> >>>>>>>>
> >>>>>>>> For the multi-statement execution, we don't see a contradication
> >> with
> >>>>>>>> the async execution behavior. We imagine a method like:
> >>>>>>>>
> >>>>>>>> TableEnvironment#executeMultilineSql(String statements):
> >>>>>>>> Iterable<TableResult>
> >>>>>>>>
> >>>>>>>> Where the `Iterator#next()` method would trigger the next
> statement
> >>>>>>>> submission. This allows a caller to decide synchronously when to
> >>>>>>>> submit statements async to the cluster. Thus, a service such as
> the
> >>>>>>>> SQL Client can handle the result of each statement individually
> and
> >>>>>>>> process statement by statement sequentially.
> >>>>>>>>
> >>>>>>>> 3. The role of TableResult and result retrieval in general
> >>>>>>>>
> >>>>>>>> `TableResult` is similar to `JobClient`. Instead of returning a
> >>>>>>>> `CompletableFuture` of something, it is a concrete util class
> where
> >>>>>>>> some methods have the behavior of completable future (e.g.
> >> collect(),
> >>>>>>>> print()) and some are already completed (getTableSchema(),
> >>>>>>>> getResultKind()).
> >>>>>>>>
> >>>>>>>> `StatementSet#execute()` returns a single `TableResult` because
> the
> >>>>>>>> order is undefined in a set and all statements have the same
> schema.
> >>>>>>>> Its `collect()` will return a row for each executed `INSERT INTO`
> in
> >>>>>>>> the order of statement definition.
> >>>>>>>>
> >>>>>>>> For simple `SELECT * FROM ...`, the query execution might block
> >> until
> >>>>>>>> `collect()` is called to pull buffered rows from the job (from
> >>>>>>>> socket/REST API what ever we will use in the future). We can say
> >> that
> >>>>>>>> a statement finished successfully, when the
> >> `collect#Iterator#hasNext`
> >>>>>>>> has returned false.
> >>>>>>>>
> >>>>>>>> I hope this summarizes our discussion @Dawid/Aljoscha/Klou?
> >>>>>>>>
> >>>>>>>> It would be great if we can add these findings to the FLIP before
> we
> >>>>>>>> start voting.
> >>>>>>>>
> >>>>>>>> One minor thing: some `execute()` methods still throw a checked
> >>>>>>>> exception; can we remove that from the FLIP? Also the above
> >> mentioned
> >>>>>>>> `Iterator#next()` would trigger an execution without throwing a
> >>>>>>>> checked exception.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Timo
> >>>>>>>>
> >>>>>>>> [1]
> >>>>>>>>
> >>>>>
> >>
> https://docs.google.com/document/d/1ueLjQWRPdLTFB_TReAyhseAX-1N3j4WYWD0F02Uau0E/edit#
> >>>>>>>> On 31.03.20 06:28, godfrey he wrote:
> >>>>>>>>> Hi, Timo & Jark
> >>>>>>>>>
> >>>>>>>>> Thanks for your explanation.
> >>>>>>>>> Agree with you that async execution should always be async,
> >>>>>>>>> and sync execution scenario can be covered  by async execution.
> >>>>>>>>> It helps provide an unified entry point for batch and streaming.
> >>>>>>>>> I think we can also use sync execution for some testing.
> >>>>>>>>> So, I agree with you that we provide `executeSql` method and it's
> >>>>> async
> >>>>>>>>> method.
> >>>>>>>>> If we want sync method in the future, we can add method named
> >>>>>>>>> `executeSqlSync`.
> >>>>>>>>>
> >>>>>>>>> I think we've reached an agreement. I will update the document,
> and
> >>>>>>>>> start
> >>>>>>>>> voting process.
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Godfrey
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Jark Wu<[email protected]>  于2020年3月31日周二 上午12:46写道：
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> I didn't follow the full discussion.
> >>>>>>>>>> But I share the same concern with Timo that streaming queries
> >> should
> >>>>>>>>>> always
> >>>>>>>>>> be async.
> >>>>>>>>>> Otherwise, I can image it will cause a lot of confusion and
> >> problems
> >>>>> if
> >>>>>>>>>> users don't deeply keep the "sync" in mind (e.g. client hangs).
> >>>>>>>>>> Besides, the streaming mode is still the majority use cases of
> >> Flink
> >>>>>>>>>> and
> >>>>>>>>>> Flink SQL. We should put the usability at a high priority.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Jark
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Mon, 30 Mar 2020 at 23:27, Timo Walther<[email protected]>
> >>>>> wrote:
> >>>>>>>>>>> Hi Godfrey,
> >>>>>>>>>>>
> >>>>>>>>>>> maybe I wasn't expressing my biggest concern enough in my last
> >> mail.
> >>>>>>>>>>> Even in a singleline and sync execution, I think that streaming
> >>>>>>>>>>> queries
> >>>>>>>>>>> should not block the execution. Otherwise it is not possible to
> >> call
> >>>>>>>>>>> collect() or print() on them afterwards.
> >>>>>>>>>>>
> >>>>>>>>>>> "there are too many things need to discuss for multiline":
> >>>>>>>>>>>
> >>>>>>>>>>> True, I don't want to solve all of them right now. But what I
> >> know
> >>>>> is
> >>>>>>>>>>> that our newly introduced methods should fit into a multiline
> >>>>>>>>>>> execution.
> >>>>>>>>>>> There is no big difference of calling `executeSql(A),
> >>>>>>>>>>> executeSql(B)` and
> >>>>>>>>>>> processing a multiline file `A;\nB;`.
> >>>>>>>>>>>
> >>>>>>>>>>> I think the example that you mentioned can simply be undefined
> >> for
> >>>>>>>>>>> now.
> >>>>>>>>>>> Currently, no catalog is modifying data but just metadata. This
> >> is a
> >>>>>>>>>>> separate discussion.
> >>>>>>>>>>>
> >>>>>>>>>>> "result of the second statement is indeterministic":
> >>>>>>>>>>>
> >>>>>>>>>>> Sure this is indeterministic. But this is the implementers
> fault
> >>>>>>>>>>> and we
> >>>>>>>>>>> cannot forbid such pipelines.
> >>>>>>>>>>>
> >>>>>>>>>>> How about we always execute streaming queries async? It would
> >>>>> unblock
> >>>>>>>>>>> executeSql() and multiline statements.
> >>>>>>>>>>>
> >>>>>>>>>>> Having a `executeSqlAsync()` is useful for batch. However, I
> >> don't
> >>>>>>>>>>> want
> >>>>>>>>>>> `sync/async` be the new batch/stream flag. The execution
> behavior
> >>>>>>>>>>> should
> >>>>>>>>>>> come from the query itself.
> >>>>>>>>>>>
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Timo
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On 30.03.20 11:12, godfrey he wrote:
> >>>>>>>>>>>> Hi Timo,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Agree with you that streaming queries is our top priority,
> >>>>>>>>>>>> but I think there are too many things need to discuss for
> >> multiline
> >>>>>>>>>>>> statements:
> >>>>>>>>>>>> e.g.
> >>>>>>>>>>>> 1. what's the behaivor of DDL and DML mixing for async
> >> execution:
> >>>>>>>>>>>> create table t1 xxx;
> >>>>>>>>>>>> create table t2 xxx;
> >>>>>>>>>>>> insert into t2 select * from t1 where xxx;
> >>>>>>>>>>>> drop table t1; // t1 may be a MySQL table, the data will also
> be
> >>>>>>>>>> deleted.
> >>>>>>>>>>>> t1 is dropped when "insert" job is running.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2. what's the behaivor of unified scenario for async
> execution:
> >>>>>>>>>>>> (as you
> >>>>>>>>>>>> mentioned)
> >>>>>>>>>>>> INSERT INTO t1 SELECT * FROM s;
> >>>>>>>>>>>> INSERT INTO t2 SELECT * FROM s JOIN t1 EMIT STREAM;
> >>>>>>>>>>>>
> >>>>>>>>>>>> The result of the second statement is indeterministic, because
> >> the
> >>>>>>>>>> first
> >>>>>>>>>>>> statement maybe is running.
> >>>>>>>>>>>> I think we need to put a lot of effort to define the behavior
> of
> >>>>>>>>>>> logically
> >>>>>>>>>>>> related queries.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In this FLIP, I suggest we only handle single statement, and
> we
> >>>>> also
> >>>>>>>>>>>> introduce an async execute method
> >>>>>>>>>>>> which is more important and more often used for users.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Dor the sync methods (like `TableEnvironment.executeSql` and
> >>>>>>>>>>>> `StatementSet.execute`),
> >>>>>>>>>>>> the result will be returned until the job is finished. The
> >>>>> following
> >>>>>>>>>>>> methods will be introduced in this FLIP:
> >>>>>>>>>>>>
> >>>>>>>>>>>>       /**
> >>>>>>>>>>>>        * Asynchronously execute the given single statement
> >>>>>>>>>>>>        */
> >>>>>>>>>>>> TableEnvironment.executeSqlAsync(String statement):
> TableResult
> >>>>>>>>>>>>
> >>>>>>>>>>>> /**
> >>>>>>>>>>>>       * Asynchronously execute the dml statements as a batch
> >>>>>>>>>>>>       */
> >>>>>>>>>>>> StatementSet.executeAsync(): TableResult
> >>>>>>>>>>>>
> >>>>>>>>>>>> public interface TableResult {
> >>>>>>>>>>>>         /**
> >>>>>>>>>>>>          * return JobClient for DQL and DML in async mode,
> else
> >>>>> return
> >>>>>>>>>>>> Optional.empty
> >>>>>>>>>>>>          */
> >>>>>>>>>>>>         Optional<JobClient> getJobClient();
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> what do you think?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Godfrey
> >>>>>>>>>>>>
> >>>>>>>>>>>> Timo Walther<[email protected]>  于2020年3月26日周四 下午9:15写道：
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Godfrey,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> executing streaming queries must be our top priority because
> >> this
> >>>>> is
> >>>>>>>>>>>>> what distinguishes Flink from competitors. If we change the
> >>>>>>>>>>>>> execution
> >>>>>>>>>>>>> behavior, we should think about the other cases as well to
> not
> >>>>> break
> >>>>>>>>>> the
> >>>>>>>>>>>>> API a third time.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I fear that just having an async execute method will not be
> >> enough
> >>>>>>>>>>>>> because users should be able to mix streaming and batch
> queries
> >>>>> in a
> >>>>>>>>>>>>> unified scenario.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If I remember it correctly, we had some discussions in the
> past
> >>>>>>>>>>>>> about
> >>>>>>>>>>>>> what decides about the execution mode of a query. Currently,
> we
> >>>>>>>>>>>>> would
> >>>>>>>>>>>>> like to let the query decide, not derive it from the sources.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> So I could image a multiline pipeline as:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> USE CATALOG 'mycat';
> >>>>>>>>>>>>> INSERT INTO t1 SELECT * FROM s;
> >>>>>>>>>>>>> INSERT INTO t2 SELECT * FROM s JOIN t1 EMIT STREAM;
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> For executeMultilineSql():
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> sync because regular SQL
> >>>>>>>>>>>>> sync because regular Batch SQL
> >>>>>>>>>>>>> async because Streaming SQL
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> For executeAsyncMultilineSql():
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> async because everything should be async
> >>>>>>>>>>>>> async because everything should be async
> >>>>>>>>>>>>> async because everything should be async
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> What we should not start for executeAsyncMultilineSql():
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> sync because DDL
> >>>>>>>>>>>>> async because everything should be async
> >>>>>>>>>>>>> async because everything should be async
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> What are you thoughts here?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>> Timo
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 26.03.20 12:50, godfrey he wrote:
> >>>>>>>>>>>>>> Hi Timo,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I agree with you that streaming queries mostly need async
> >>>>>>>>>>>>>> execution.
> >>>>>>>>>>>>>> In fact, our original plan is only introducing sync methods
> in
> >>>>> this
> >>>>>>>>>>> FLIP,
> >>>>>>>>>>>>>> and async methods (like "executeSqlAsync") will be
> introduced
> >> in
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>> future
> >>>>>>>>>>>>>> which is mentioned in the appendix.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Maybe the async methods also need to be considered in this
> >> FLIP.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I think sync methods is also useful for streaming which can
> be
> >>>>> used
> >>>>>>>>>> to
> >>>>>>>>>>>>> run
> >>>>>>>>>>>>>> bounded source.
> >>>>>>>>>>>>>> Maybe we should check whether all sources are bounded in
> sync
> >>>>>>>>>> execution
> >>>>>>>>>>>>>> mode.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Also, if we block for streaming queries, we could never
> >> support
> >>>>>>>>>>>>>>> multiline files. Because the first INSERT INTO would block
> >> the
> >>>>>>>>>> further
> >>>>>>>>>>>>>>> execution.
> >>>>>>>>>>>>>> agree with you, we need async method to submit multiline
> >> files,
> >>>>>>>>>>>>>> and files should be limited that the DQL and DML should be
> >>>>>>>>>>>>>> always in
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> end for streaming.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>> Godfrey
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Timo Walther<[email protected]>  于2020年3月26日周四 下午4:29写道：
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Godfrey,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> having control over the job after submission is a
> requirement
> >>>>> that
> >>>>>>>>>> was
> >>>>>>>>>>>>>>> requested frequently (some examples are [1], [2]). Users
> >> would
> >>>>>>>>>>>>>>> like
> >>>>>>>>>> to
> >>>>>>>>>>>>>>> get insights about the running or completed job. Including
> >> the
> >>>>>>>>>> jobId,
> >>>>>>>>>>>>>>> jobGraph etc., the JobClient summarizes these properties.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It is good to have a discussion about
> >> synchronous/asynchronous
> >>>>>>>>>>>>>>> submission now to have a complete execution picture.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I thought we submit streaming queries mostly async and just
> >>>>>>>>>>>>>>> wait for
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>> successful submission. If we block for streaming queries,
> how
> >>>>>>>>>>>>>>> can we
> >>>>>>>>>>>>>>> collect() or print() results?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Also, if we block for streaming queries, we could never
> >> support
> >>>>>>>>>>>>>>> multiline files. Because the first INSERT INTO would block
> >> the
> >>>>>>>>>> further
> >>>>>>>>>>>>>>> execution.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> If we decide to block entirely on streaming queries, we
> need
> >> the
> >>>>>>>>>> async
> >>>>>>>>>>>>>>> execution methods in the design already. However, I would
> >>>>>>>>>>>>>>> rather go
> >>>>>>>>>>> for
> >>>>>>>>>>>>>>> non-blocking streaming queries. Also with the `EMIT STREAM`
> >> key
> >>>>>>>>>>>>>>> word
> >>>>>>>>>>> in
> >>>>>>>>>>>>>>> mind that we might add to SQL statements soon.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>> Timo
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> [1]https://issues.apache.org/jira/browse/FLINK-16761
> >>>>>>>>>>>>>>> [2]https://issues.apache.org/jira/browse/FLINK-12214
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 25.03.20 16:30, godfrey he wrote:
> >>>>>>>>>>>>>>>> Hi Timo,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks for the updating.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Regarding to "multiline statement support", I'm also fine
> >> that
> >>>>>>>>>>>>>>>> `TableEnvironment.executeSql()` only supports single line
> >>>>>>>>>> statement,
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>> can support multiline statement later (needs more
> discussion
> >>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>> this).
> >>>>>>>>>>>>>>>> Regarding to "StatementSet.explian()", I don't have strong
> >>>>>>>>>>>>>>>> opinions
> >>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>> that.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Regarding to "TableResult.getJobClient()", I think it's
> >>>>>>>>>> unnecessary.
> >>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>> reason is: first, many statements (e.g. DDL, show xx, use
> >> xx)
> >>>>>>>>>>>>>>>> will
> >>>>>>>>>>> not
> >>>>>>>>>>>>>>>> submit a Flink job. second,
> `TableEnvironment.executeSql()`
> >> and
> >>>>>>>>>>>>>>>> `StatementSet.execute()` are synchronous method,
> >> `TableResult`
> >>>>>>>>>>>>>>>> will
> >>>>>>>>>>> be
> >>>>>>>>>>>>>>>> returned only after the job is finished or failed.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Regarding to "whether StatementSet.execute() needs to
> throw
> >>>>>>>>>>>>> exception", I
> >>>>>>>>>>>>>>>> think we should choose a unified way to tell whether the
> >>>>>>>>>>>>>>>> execution
> >>>>>>>>>> is
> >>>>>>>>>>>>>>>> successful. If `TableResult` contains ERROR kind
> >> (non-runtime
> >>>>>>>>>>>>> exception),
> >>>>>>>>>>>>>>>> users need to not only check the result but also catch the
> >>>>>>>>>>>>>>>> runtime
> >>>>>>>>>>>>>>>> exception in their code. or `StatementSet.execute()` does
> >> not
> >>>>>>>>>>>>>>>> throw
> >>>>>>>>>>> any
> >>>>>>>>>>>>>>>> exception (including runtime exception), all exception
> >>>>>>>>>>>>>>>> messages are
> >>>>>>>>>>> in
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> result.  I prefer "StatementSet.execute() needs to throw
> >>>>>>>>>> exception".
> >>>>>>>>>>> cc
> >>>>>>>>>>>>>>> @Jark
> >>>>>>>>>>>>>>>> Wu<[email protected]>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I will update the agreed parts to the document first.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>> Godfrey
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Timo Walther<[email protected]>  于2020年3月25日周三
> >>>>>>>>>>>>>>>> 下午6:51写道：
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Godfrey,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> thanks for starting the discussion on the mailing list.
> And
> >>>>>>>>>>>>>>>>> sorry
> >>>>>>>>>>>>> again
> >>>>>>>>>>>>>>>>> for the late reply to FLIP-84. I have updated the Google
> >> doc
> >>>>> one
> >>>>>>>>>>> more
> >>>>>>>>>>>>>>>>> time to incorporate the offline discussions.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>         From Dawid's and my view, it is fine to postpone
> the
> >>>>>>>>>>>>>>>>> multiline
> >>>>>>>>>>>>> support
> >>>>>>>>>>>>>>>>> to a separate method. This can be future work even though
> >> we
> >>>>>>>>>>>>>>>>> will
> >>>>>>>>>>> need
> >>>>>>>>>>>>>>>>> it rather soon.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> If there are no objections, I suggest to update the
> FLIP-84
> >>>>>>>>>>>>>>>>> again
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> have another voting process.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>> Timo
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On 25.03.20 11:17, godfrey he wrote:
> >>>>>>>>>>>>>>>>>> Hi community,
> >>>>>>>>>>>>>>>>>> Timo, Fabian and Dawid have some feedbacks about
> >> FLIP-84[1].
> >>>>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>> feedbacks
> >>>>>>>>>>>>>>>>>> are all about new introduced methods. We had a
> discussion
> >>>>>>>>>>> yesterday,
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> most of feedbacks have been agreed upon. Here is the
> >>>>>>>>>>>>>>>>>> conclusions:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> *1. about proposed methods in `TableEnvironment`:*
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> the original proposed methods:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> TableEnvironment.createDmlBatch(): DmlBatch
> >>>>>>>>>>>>>>>>>> TableEnvironment.executeStatement(String statement):
> >>>>>>>>>>>>>>>>>> ResultTable
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> the new proposed methods:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> // we should not use abbreviations in the API, and the
> >> term
> >>>>>>>>>> "Batch"
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>> easily confused with batch/streaming processing
> >>>>>>>>>>>>>>>>>> TableEnvironment.createStatementSet(): StatementSet
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> // every method that takes SQL should have `Sql` in its
> >> name
> >>>>>>>>>>>>>>>>>> // supports multiline statement ???
> >>>>>>>>>>>>>>>>>> TableEnvironment.executeSql(String statement):
> TableResult
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> // new methods. supports explaining DQL and DML
> >>>>>>>>>>>>>>>>>> TableEnvironment.explainSql(String statement,
> >>>>> ExplainDetail...
> >>>>>>>>>>>>>>> details):
> >>>>>>>>>>>>>>>>>> String
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> *2. about proposed related classes:*
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> the original proposed classes:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> interface DmlBatch {
> >>>>>>>>>>>>>>>>>>             void addInsert(String insert);
> >>>>>>>>>>>>>>>>>>             void addInsert(String targetPath, Table
> table);
> >>>>>>>>>>>>>>>>>>             ResultTable execute() throws Exception ;
> >>>>>>>>>>>>>>>>>>             String explain(boolean extended);
> >>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> public interface ResultTable {
> >>>>>>>>>>>>>>>>>>             TableSchema getResultSchema();
> >>>>>>>>>>>>>>>>>>             Iterable<Row> getResultRows();
> >>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> the new proposed classes:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> interface StatementSet {
> >>>>>>>>>>>>>>>>>>             // every method that takes SQL should have
> >> `Sql` in
> >>>>>>>>>>>>>>>>>> its
> >>>>>>>>>>> name
> >>>>>>>>>>>>>>>>>>             // return StatementSet instance for fluent
> >>>>> programming
> >>>>>>>>>>>>>>>>>>             addInsertSql(String statement): StatementSet
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>             // return StatementSet instance for fluent
> >>>>> programming
> >>>>>>>>>>>>>>>>>>             addInsert(String tablePath, Table table):
> >>>>> StatementSet
> >>>>>>>>>>>>>>>>>>             // new method. support overwrite mode
> >>>>>>>>>>>>>>>>>>             addInsert(String tablePath, Table table,
> >> boolean
> >>>>>>>>>>> overwrite):
> >>>>>>>>>>>>>>>>>> StatementSet
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>             explain(): String
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>             // new method. supports adding more details
> >> for the
> >>>>>>>>>> result
> >>>>>>>>>>>>>>>>>>             explain(ExplainDetail... extraDetails):
> String
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>             // throw exception ???
> >>>>>>>>>>>>>>>>>>             execute(): TableResult
> >>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> interface TableResult {
> >>>>>>>>>>>>>>>>>>             getTableSchema(): TableSchema
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>             // avoid custom parsing of an "OK" row in
> >>>>> programming
> >>>>>>>>>>>>>>>>>>             getResultKind(): ResultKind
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>             // instead of `get` make it explicit that
> this
> >> is
> >>>>>>>>>>>>>>>>>> might
> >>>>>>>>>> be
> >>>>>>>>>>>>>>>>> triggering
> >>>>>>>>>>>>>>>>>> an expensive operation
> >>>>>>>>>>>>>>>>>>             collect(): Iterable<Row>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>             // for fluent programming
> >>>>>>>>>>>>>>>>>>             print(): Unit
> >>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> enum ResultKind {
> >>>>>>>>>>>>>>>>>>             SUCCESS, // for DDL, DCL and statements
> with a
> >>>>> simple
> >>>>>>>>>> "OK"
> >>>>>>>>>>>>>>>>>>             SUCCESS_WITH_CONTENT, // rows with important
> >>>>>>>>>>>>>>>>>> content are
> >>>>>>>>>>>>>>> available
> >>>>>>>>>>>>>>>>>> (DML, DQL)
> >>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> *3. new proposed methods in `Table`*
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> `Table.insertInto()` will be deprecated, and the
> following
> >>>>>>>>>> methods
> >>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>> introduced:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Table.executeInsert(String tablePath): TableResult
> >>>>>>>>>>>>>>>>>> Table.executeInsert(String tablePath, boolean
> overwrite):
> >>>>>>>>>>> TableResult
> >>>>>>>>>>>>>>>>>> Table.explain(ExplainDetail... details): String
> >>>>>>>>>>>>>>>>>> Table.execute(): TableResult
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> There are two issues need further discussion, one is
> >> whether
> >>>>>>>>>>>>>>>>>> `TableEnvironment.executeSql(String statement):
> >> TableResult`
> >>>>>>>>>> needs
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> support multiline statement (or whether
> `TableEnvironment`
> >>>>>>>>>>>>>>>>>> needs
> >>>>>>>>>> to
> >>>>>>>>>>>>>>>>> support
> >>>>>>>>>>>>>>>>>> multiline statement), and another one is whether
> >>>>>>>>>>>>>>> `StatementSet.execute()`
> >>>>>>>>>>>>>>>>>> needs to throw exception.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> please refer to the feedback document [2] for the
> details.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Any suggestions are warmly welcomed!
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>
> >>>>>
> >>
> https://wiki.apache.org/confluence/pages/viewpage.action?pageId=134745878
> >>>>>>>>>>>>>>>>>> [2]
> >>>>>>>>>>>>>>>>>>
> >>>>>
> >>
> https://docs.google.com/document/d/1ueLjQWRPdLTFB_TReAyhseAX-1N3j4WYWD0F02Uau0E/edit
> >>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>> Godfrey
> >>>>>>>>>>>>>>>>>>
> >>
> >>
> >
>
>

Re: [DISCUSS] FLIP-84 Feedback Summary

Reply via email to