tustvold commented on issue #1532: URL: https://github.com/apache/arrow-datafusion/issues/1532#issuecomment-1012985001
> That's why I think it would be good if we can come up with a way to avoid cherry-picking commits from arrow2 into arrow-rs Sorry, I meant more cherry-picking ideas, not actual implementation. As in you might copy across arrow-2's `Buffer` implementation, add a conversion to `arrow-rs`'s `Buffer` implementation and then migrate the array implementations across one-by-one. Or do something similar for `MutableBuffer`. Ultimately the in-memory format is the same arrow spec, just getting wrapped up in different ways - the whole point of arrow is conversion between the two representations should be cheap :smile:. I guess I've just had bad past experiences of simultaneously changing all the things at once :laughing:. Having looked at the `arrow2` parquet implementation, as it is the part of the `arrow-rs` codebase I'm most familiar with, there is a fair amount of non-trivial functionality loss compared to `arrow-rs`. Some of it is esoteric things like nested structures, but also larger omissions like certain page encodings or batch size control<sup>1.</sup> (it appears to read entire row groups into a single RecordBatch??). This is unlikely to be a strictly additive change, and I'm having a very hard time getting my head around all of its implications. That's all I really care about, that we can communicate something more than "everything may or may not be broken" :laughing: _<sup>1.</sup> FWIW this is the thing that makes reading parquet tricky, as pages don't delimit rows across columns or even semantic records within a column. If you just read row groups, it will be simple and fast, but recommendations are for row groups on the order of 1GB compressed :sweat_smile: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org