jonded94 commented on issue #9423:
URL: https://github.com/apache/arrow-rs/issues/9423#issuecomment-3946447608

   > @jonded94 Cool! Is disco-parquet open source?
   
   Unfortunately not, as I stated here already:
   
   > Note that we ideally either want to make our library public, or contribute 
things back to arro3, if possible. This just unfortunately hasn't been on our 
agenda at all and we haven't found much time for doing so 😢
   
   It's unclear to me when/whether we will get necessary capacity to do this. 
In any case, our internal library actually has a slightly different philosophy 
than `arro3` (mostly driven out of deadlines, not technical reasons): Whilst 
`arro3` appears to aim for maximising modularity and minimalness for providing 
a fully `pyarrow`-less and pure Rust-based experience, our library is a bit 
more monolithic and at least right now strictly dependent on `pyarrow` (because 
it uses the `arrow-pyarrow` subcrate for the Arrow<->PyO3 interaction). 
   Its strength right now lies more in the field of interacting with parquet 
files and datasets for random sampling workflows [1] as performant and resource 
efficient as possible, all through an hopefully easy to use Python interface. 
But this shouldn't be too hard to include in `arro3` too, as I think there was 
a PR open for some while that adds `ParquetFile` and `ParquetDataset` classes. 
One easily could add methods that read a single row group with some additional 
`RowSelection` filter and such, and datasets would require some `async` magic 
for performant multi-file metadata lookups on latency-heavy filesystems, and 
after that there isn't too much additional magic left that our library provides 
right now.
   
   [1] (i.e. continously decode just a few rows out of random row groups across 
dozens of different parquet datasets in parallel)
   
   >  I would just note that it is one community, the arrow community, I would 
be careful to focus on highlighting positives (of arrow-rs) vs highlighting 
negatives (of pyarrow).
   
   Totally! We want to keep the `pyarrow` comparisons as minimal as absolutely 
possible, as we don't want to point fingers at friendly communities and 
honestly we also haven't been too good properly reporting our issues (we've 
opened a bunch of Github issues, but the C++/Cython codebase is just too 
complex for us too help debugging in any productive way). It's just that, at 
least for our niche use case, we have been more productive reaching absolute 
lowest resource consumptions with just writing a small wrapper library around 
`arrow-rs`, instead of trying to tune and understand how `pyarrow`'s memory 
pools and general parquet file & dataset classes behave, as with `arrow-rs` you 
seem to have some stronger control over this.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to