[jira] [Commented] (FLINK-10929) Add support for Apache Arrow

Stephan Ewen (JIRA) Thu, 02 May 2019 03:17:32 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-10929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831530#comment-16831530
 ]


Stephan Ewen commented on FLINK-10929:
--------------------------------------

[~fan_li_ya]

A lot of this seems applicable for the early stages of pulling data in and 
filtering / projecting the data. So in combination with sources, I can see a 
benefit, but that is different from a deep integration into the query processor.

For a deep integration into the query processor, the answer is not clear. Like 
Kurt pointed out, specifically
 - Batch / Streaming unification
 - High fan-out shuffles (work better row-wise)
 - Row materialization strategies for join cascades are non trivial

Not only would this be a huge engineering effort until this has full feature 
coverage. I also don't see how with all these open questions we can possibly 
start working on such an integration into the runtime. It would mean breaking 
everything without knowing whether it would work out in the end.

I see only one way forward for Arrow in the query processor:
 - First, we finish the current query processor work, based on the Blink merge, 
and all the related work around batch failover.
 - Second, someone make a PoC by forking the Blink query processor and modify 
it to work with Arrow. That PoC would need to show that it will be possible to 
get the same feature coverage (there is no fundamental design issue / blocker) 
and that there are relevant speedups in many cases, and no bad regressions (in 
performance, stability, resource footprint) in too many other cases. That means 
having a solid design or PoC implementation for all complex cases like 
(inner-/outer-/semi-/anti-) joins, time-versioned joins (in memory and 
spilling), aggregations, high fan-out shuffles, etc.
 - We could then make this additional query processor available through the 
pluggable Query Planner mechanism, in the same way as the current Flink SQL 
engine and the Blink SQL engine exist side by side for now.

 

> Add support for Apache Arrow
> ----------------------------
>
>                 Key: FLINK-10929
>                 URL: https://issues.apache.org/jira/browse/FLINK-10929
>             Project: Flink
>          Issue Type: Wish
>          Components: Runtime / State Backends
>            Reporter: Pedro Cardoso Silva
>            Priority: Minor
>         Attachments: image-2019-04-10-13-43-08-107.png
>
>
> Investigate the possibility of adding support for Apache Arrow as a 
> standardized columnar, memory format for data.
> Given the activity that [https://github.com/apache/arrow] is currently 
> getting and its claims objective of providing a zero-copy, standardized data 
> format across platforms, I think it makes sense for Flink to look into 
> supporting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-10929) Add support for Apache Arrow

Reply via email to