[
https://issues.apache.org/jira/browse/FLINK-10929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831530#comment-16831530
]
Stephan Ewen commented on FLINK-10929:
--------------------------------------
[~fan_li_ya]
A lot of this seems applicable for the early stages of pulling data in and
filtering / projecting the data. So in combination with sources, I can see a
benefit, but that is different from a deep integration into the query processor.
For a deep integration into the query processor, the answer is not clear. Like
Kurt pointed out, specifically
- Batch / Streaming unification
- High fan-out shuffles (work better row-wise)
- Row materialization strategies for join cascades are non trivial
Not only would this be a huge engineering effort until this has full feature
coverage. I also don't see how with all these open questions we can possibly
start working on such an integration into the runtime. It would mean breaking
everything without knowing whether it would work out in the end.
I see only one way forward for Arrow in the query processor:
- First, we finish the current query processor work, based on the Blink merge,
and all the related work around batch failover.
- Second, someone make a PoC by forking the Blink query processor and modify
it to work with Arrow. That PoC would need to show that it will be possible to
get the same feature coverage (there is no fundamental design issue / blocker)
and that there are relevant speedups in many cases, and no bad regressions (in
performance, stability, resource footprint) in too many other cases. That means
having a solid design or PoC implementation for all complex cases like
(inner-/outer-/semi-/anti-) joins, time-versioned joins (in memory and
spilling), aggregations, high fan-out shuffles, etc.
- We could then make this additional query processor available through the
pluggable Query Planner mechanism, in the same way as the current Flink SQL
engine and the Blink SQL engine exist side by side for now.
> Add support for Apache Arrow
> ----------------------------
>
> Key: FLINK-10929
> URL: https://issues.apache.org/jira/browse/FLINK-10929
> Project: Flink
> Issue Type: Wish
> Components: Runtime / State Backends
> Reporter: Pedro Cardoso Silva
> Priority: Minor
> Attachments: image-2019-04-10-13-43-08-107.png
>
>
> Investigate the possibility of adding support for Apache Arrow as a
> standardized columnar, memory format for data.
> Given the activity that [https://github.com/apache/arrow] is currently
> getting and its claims objective of providing a zero-copy, standardized data
> format across platforms, I think it makes sense for Flink to look into
> supporting it.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)