[GitHub] [arrow-datafusion] jorgecarleitao edited a comment on issue #1532: Discussion: Switch DataFusion to using arrow2?

GitBox Fri, 14 Jan 2022 10:14:32 -0800


jorgecarleitao edited a comment on issue #1532:
URL: 
https://github.com/apache/arrow-datafusion/issues/1532#issuecomment-1013347710



   I would like to thank all of you have have been working on the PR, and also 
to all of those that already gave it a spin. Incredibly humbled and thankful 
for it.
   
   @tusvold, thanks a lot for raising these concerns, much appreciated and 
valid.
   
   I agree with you that batch control is useful to avoid a very large memory 
footprint. I have added it [as an issue on 
arrow2](https://github.com/jorgecarleitao/arrow2/issues/768). Let me know if it 
captures your main point.
   
   wrt to the encoders, I have been challenged in finding parquet writers that 
can write such encodings, so that we can integration-test them against when 
implementing them. I would be happy to add them to our CI and prove correctness 
and completeness of the implementation (and fix what is missing) - the process 
in arrow2 wrt to formats has been that we need at least one official 
implementation or 2 non-official implementations to confirm correctness of 
arrow2's implementation.
   
   Since a comparison was made, I think that we could be fair and enumerate 
disadvantages and advantages of each other. For example, arrow-rs/parquet 
currently supports for deep nested parquet types, while arrow2 does not. 
Datafusion does not use them much, but there are important use-cases where they 
appear. Arrow-rs has a larger user-base by crate downloads and it is an 
official implementation. Arrow-rs also has pyarrow support out of the box (for 
those using pyo3), while arrow2 does not.
   
   OTOH, arrow2 implements the complete arrow specification (under the process 
mentioned above), has `async` support to all its IO except arrow stream read, 
all its MIRI tests pass, its IO reading except parquet is panic free, actively 
implements the suggestions from Rust security WG and portable simd WG, its IO 
is `forbid(unsafe_code)`, it has faster compute kernels, and its internals are 
simpler to understand and use, leveraging strong typed data structures.
   
   Now, under the argument that it is the same in-memory format after all and 
what matters is feature completeness, I could argue that we should then create 
a thin FFI wrapper for the C++ implementation in Rust, abandon the Rust 
implementations altogether, and all contribute to the official C++.
   
   Which brings me to my main point: imo this is not about which implementation 
has the fastest parquet reader or writer, it is about which code base has the 
most solid foundations for all of us to develop the next generation of 
columnar-based query engines, web applications leveraging arrow's forte, cool 
web-based streaming apps leveraging Arrow and Tokio, distributed query engines 
on AWS lambdas, etc., on a programming paradigm centred around correctness, 
easiness of use, and performance.
   
   The fact that datafusion never passed MIRI checks and that it has been like 
this since its inception shows that these are not simple to fix issues nor the 
arrows' internals are sufficiently appealing for the community to fix it (at 
least to the unpaid ones like myself). Furthermore
   
   * despite Andrews' amazing effort in validating input data to arrays, not 
all arrow-rs tests pass MIRI yet (arrow2 passes, including roundtrips on all 
its IO and FFI components)
   * parquet crate is not tested under MIRI (both parquet2 and arrow2's IO are 
`#[forbid(unsafe_code)]`)
   
   With that said, is there a conclusion that the root cause for the high 
memory usage results from not batching parquet column chunks in smaller arrays, 
or is it an hypothesis that we need to test? Is that the primary concern here 
and it is sufficiently important to block adoption? If yes, I would gladly work 
towards addressing it upstream.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] jorgecarleitao edited a comment on issue #1532: Discussion: Switch DataFusion to using arrow2?

Reply via email to