[jira] [Commented] (FLINK-14807) Add Table#collect api for fetching data to client

Stephan Ewen (Jira) Thu, 05 Mar 2020 04:52:27 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-14807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052101#comment-17052101
 ]


Stephan Ewen commented on FLINK-14807:
--------------------------------------

Nice to see this lively discussion. I agree that we should come up with a 
proper solution and works backwards from there.
It would also be great to find a way that works across batch and streaming. Not 
sure how easy that is possible, but maybe we can think along those lines.

About the ideas proposed so far:

* Concerning exactly-once semantics: That would be nice, but it would 
inherently introduce a "checkpoint interval delay", because we can only feed 
back completed results. Maybe that is okay in the streaming world, for batch 
there is anyways only the "end of job" state.

* Concerning master-failover: I think we can get away with not supporting that 
in the first version, as long as we point out that limitation.

* Concerning task-failover: That sounds like an expected functionality to me

* Server sockets as the basic data exchange plane are tricky. Lost connections, 
retries, etc. There needs to be a more sophisticated protocol on top of that in 
any case.
* The GlobalAggregateManager is itself actually a pretty bad hack at the 
moment, which I would hope to replace by the {{OperatorCoordinators}} in the 
future.

One thing we could try and do is enhance the accumulators, with blob-server 
offload.
* This would be similar as the RPC service works. Small accumulator results are 
directly in an RPC message, larger (aggregated results) may be offloaded to the 
blob storage.
* We would send new accumulators with the checkpoint (or checkpoint commit), 
giving exactly once semantics.
* Clients would pull the accumulators, the rest server would need to pull them 
from blob storage if necessary.
* That might work, but it is a lot of indirection / out of band transfer as 
soon as results get larger. Which might be expected to some extend, to keep the 
memory footprint of the processes small.

> Add Table#collect api for fetching data to client
> -------------------------------------------------
>
>                 Key: FLINK-14807
>                 URL: https://issues.apache.org/jira/browse/FLINK-14807
>             Project: Flink
>          Issue Type: New Feature
>          Components: Table SQL / API
>    Affects Versions: 1.9.1
>            Reporter: Jeff Zhang
>            Priority: Major
>              Labels: usability
>             Fix For: 1.11.0
>
>         Attachments: table-collect-draft.patch, table-collect.png
>
>
> Currently, it is very unconvinient for user to fetch data of flink job unless 
> specify sink expclitly and then fetch data from this sink via its api (e.g. 
> write to hdfs sink, then read data from hdfs). However, most of time user 
> just want to get the data and do whatever processing he want. So it is very 
> necessary for flink to provide api Table#collect for this purpose. 
>  
> Other apis such as Table#head, Table#print is also helpful.  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-14807) Add Table#collect api for fetching data to client

Reply via email to