jonded94 commented on issue #9423: URL: https://github.com/apache/arrow-rs/issues/9423#issuecomment-3946447608
> @jonded94 Cool! Is disco-parquet open source? Unfortunately not, as I stated here already: > Note that we ideally either want to make our library public, or contribute things back to arro3, if possible. This just unfortunately hasn't been on our agenda at all and we haven't found much time for doing so 😢 It's unclear to me when/whether we will get necessary capacity to do this. In any case, our internal library actually has a slightly different philosophy than `arro3` (mostly driven out of deadlines, not technical reasons): Whilst `arro3` appears to aim for maximising modularity and minimalness for providing a fully `pyarrow`-less and pure Rust-based experience, our library is a bit more monolithic and at least right now strictly dependent on `pyarrow` (because it uses the `arrow-pyarrow` subcrate for the Arrow<->PyO3 interaction). Its strength right now lies more in the field of interacting with parquet files and datasets for random sampling workflows [1] as performant and resource efficient as possible, all through an hopefully easy to use Python interface. But this shouldn't be too hard to include in `arro3` too, as I think there was a PR open for some while that adds `ParquetFile` and `ParquetDataset` classes. One easily could add methods that read a single row group with some additional `RowSelection` filter and such, and datasets would require some `async` magic for performant multi-file metadata lookups on latency-heavy filesystems, and after that there isn't too much additional magic left that our library provides right now. [1] (i.e. continously decode just a few rows out of random row groups across dozens of different parquet datasets in parallel) > I would just note that it is one community, the arrow community, I would be careful to focus on highlighting positives (of arrow-rs) vs highlighting negatives (of pyarrow). Totally! We want to keep the `pyarrow` comparisons as minimal as absolutely possible, as we don't want to point fingers at friendly communities and honestly we also haven't been too good properly reporting our issues (we've opened a bunch of Github issues, but the C++/Cython codebase is just too complex for us too help debugging in any productive way). It's just that, at least for our niche use case, we have been more productive reaching absolute lowest resource consumptions with just writing a small wrapper library around `arrow-rs`, instead of trying to tune and understand how `pyarrow`'s memory pools and general parquet file & dataset classes behave, as with `arrow-rs` you seem to have some stronger control over this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
