[GitHub] [arrow-datafusion] tustvold commented on issue #1532: Discussion: Switch DataFusion to using arrow2?

GitBox Fri, 14 Jan 2022 02:13:13 -0800


tustvold commented on issue #1532:
URL: 
https://github.com/apache/arrow-datafusion/issues/1532#issuecomment-1012985001



   > That's why I think it would be good if we can come up with a way to avoid 
cherry-picking commits from arrow2 into arrow-rs
   
   Sorry, I meant more cherry-picking ideas, not actual implementation. As in 
you might copy across arrow-2's `Buffer` implementation, add a conversion to 
`arrow-rs`'s `Buffer` implementation and then migrate the array implementations 
across one-by-one. Or do something similar for `MutableBuffer`. Ultimately the 
in-memory format is the same arrow spec, just getting wrapped up in different 
ways - the whole point of arrow is conversion between the two representations 
should be cheap :smile:. 
   
   I guess I've just had bad past experiences of simultaneously changing all 
the things at once :laughing:. Having looked at the `arrow2` parquet 
implementation, as it is the part of the `arrow-rs` codebase I'm most familiar 
with, there is a fair amount of non-trivial functionality loss compared to 
`arrow-rs`. Some of it is esoteric things like nested structures, but also 
larger omissions like certain page encodings or batch size control<sup>1.</sup> 
(it appears to read entire row groups into a single RecordBatch??). 
   
   This is unlikely to be a strictly additive change, and I'm having a very 
hard time getting my head around all of its implications. That's all I really 
care about, that we can communicate something more than "everything may or may 
not be broken" :laughing:  
   
   _<sup>1.</sup> FWIW this is the thing that makes reading parquet tricky, as 
pages don't delimit rows across columns or even semantic records within a 
column. If you just read row groups, it will be simple and fast, but 
recommendations are for row groups on the order of 1GB compressed :sweat_smile: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on issue #1532: Discussion: Switch DataFusion to using arrow2?

Reply via email to