[GitHub] [arrow-datafusion] tustvold commented on issue #1532: Discussion: Switch DataFusion to using arrow2?

GitBox Tue, 11 Jan 2022 03:25:51 -0800


tustvold commented on issue #1532:
URL: 
https://github.com/apache/arrow-datafusion/issues/1532#issuecomment-1009868992



   > Will arrow-rs eventually support async file IO? Requiring a synchronous 
ChuckReader is currently a major limitation in supporting alternate ObjectStores
   
   FWIW it would be relatively straightforward to support async IO within the 
context of arrow-rs. You need buffered fetching in order to get reasonable IO 
performance anyway, and so you just do an async fetch into a buffer and then 
use the sync decoders to decode it. I believe this is what arrow2 is doing 
anyway?? I quickly cobbled something together showing how this can be done with 
parquet [here](https://github.com/apache/arrow-rs/pull/1154).
   
   FWIW I have some optimisations to the arrow-rs parquet reader in flight that 
yield some pretty significant speedups 
https://github.com/apache/arrow-rs/pull/1054, 
https://github.com/apache/arrow-rs/pull/1082. And I am planning to work on 
dictionary preservation next which should yield orders of magnitude speedups 
for string dictionaries.
   
   I would _personally_ prefer an approach that sees the great work on arrow2 
cherry-picked into arrow-rs, with `arrow2` serving as an incubator for new 
ideas. I am happy to help out with this if there are things people would 
particularly like to see ported across? The current ecosystem fragmentation is 
just unfortunate for both users and contributors imo...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] tustvold commented on issue #1532: Discussion: Switch DataFusion to using arrow2?

Reply via email to