[Discuss] Add support for Apache Arrow

Run Thu, 11 Apr 2019 05:22:14 -0700

Hi guys,


Apache Arrow provides a cross-language, standardized, columnar, memory format 
for data. 
So it is highly desirable to import Arrow to Flink, and make use of its memory 
layout and memory management facilities.
More background on this can be found in 
https://issues.apache.org/jira/browse/FLINK-10929


The problem is that, incorporating Arrow to Flink is a big move, and may affect 
many modules of Flink.
In our efforts to vectorize Blink batch jobs, we have tried to incorporate 
Arrow to Flink. Those were incremental changes, which can be easily turned off 
with a single flag, and the changes are transparent to other parts of the code 
base.


This is the draft of our design document: 
https://docs.google.com/document/d/1LstaiGzlzTdGUmyG_-9GSWleKpLRid8SJWXQCTqcQB4/edit?usp=sharing


For the first step, I suggest providing a flag which is disabled by default, 
and let the MemoryManager depend on the Arrow Buffer Allocator. With this 
change, all the MemorySegment will be based on Arrow buffers, but this is 
transparent to other components, and should never break them.


After this step, we can apply the changes described in the design documents, 
incrementally.
This is our initial thoughts. Would you please give your valuable comments?


Best,
Liya Fan

[Discuss] Add support for Apache Arrow

Reply via email to