jorgecarleitao edited a comment on issue #1532: URL: https://github.com/apache/arrow-datafusion/issues/1532#issuecomment-1013347710
I would like to thank all of you have have been working on the PR, and also to all of those that already gave it a spin. Incredibly humbled and thankful for it. @tusvold, thanks a lot for raising these concerns, much appreciated and valid. I agree with you that batch control is useful to avoid a very large memory footprint. I have added it [as an issue on arrow2](https://github.com/jorgecarleitao/arrow2/issues/768). Let me know if it captures your main point. wrt to the encoders, I have been challenged in finding parquet writers that can write such encodings, so that we can integration-test them against when implementing them. I would be happy to add them to our CI and prove correctness and completeness of the implementation (and fix what is missing) - the process in arrow2 wrt to formats has been that we need at least one official implementation or 2 non-official implementations to confirm correctness of arrow2's implementation. Since a comparison was made, I think that we could be fair and enumerate disadvantages and advantages of each other. For example, arrow-rs/parquet currently supports for deep nested parquet types, while arrow2 does not. Datafusion does not use them much, but there are important use-cases where they appear. Arrow-rs has a larger user-base by crate downloads and it is an official implementation. Arrow-rs also has pyarrow support out of the box (for those using pyo3), while arrow2 does not. OTOH, arrow2 implements the complete arrow specification (under the process mentioned above), has `async` support to all its IO except arrow stream read, all its MIRI tests pass, its IO reading except parquet is panic free, actively implements the suggestions from Rust security WG and portable simd WG, its IO is `forbid(unsafe_code)`, it has faster compute kernels, and its internals are simpler to understand and use, leveraging strong typed data structures. Now, under the argument that it is the same in-memory format after all and what matters is feature completeness, I could argue that we should then create a thin FFI wrapper for the C++ implementation in Rust, abandon the Rust implementations altogether, and all contribute to the official C++. Which brings me to my main point: imo this is not about which implementation has the fastest parquet reader or writer, it is about which code base has the most solid foundations for all of us to develop the next generation of columnar-based query engines, web applications leveraging arrow's forte, cool web-based streaming apps leveraging Arrow and Tokio, distributed query engines on AWS lambdas, etc., on a programming paradigm centred around correctness, easiness of use, and performance. The fact that datafusion never passed MIRI checks and that it has been like this since its inception shows that these are not simple to fix issues nor the arrows' internals are sufficiently appealing for the community to fix it (at least to the unpaid ones like myself). Furthermore * despite Andrews' amazing effort in validating input data to arrays, not all arrow-rs tests pass MIRI yet (arrow2 passes, including roundtrips on all its IO and FFI components) * parquet crate is not tested under MIRI (both parquet2 and arrow2's IO are `#[forbid(unsafe_code)]`) With that said, is there a conclusion that the root cause for the high memory usage results from not batching parquet column chunks in smaller arrays, or is it an hypothesis that we need to test? Is that the primary concern here and it is sufficiently important to block adoption? If yes, I would gladly work towards addressing it upstream. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
