[GitHub] [arrow-rs] tustvold commented on issue #2394: non-annotated Repeated fields are read incorrectly

GitBox Fri, 12 Aug 2022 04:48:12 -0700


tustvold commented on issue #2394:
URL: https://github.com/apache/arrow-rs/issues/2394#issuecomment-1213028137


   _The following is potentially somewhat subjective, so take with a grain of 
salt, but is I think fair_
   
   > column reader
   
   The low-level 
[column](https://docs.rs/parquet/latest/parquet/column/index.html) API is still 
actively developed, in so much as the arrow internals make use of it. However, 
it is worth noting that they decode to their own buffer implementations instead 
of using `[DataType::T]`, as especially for byte arrays this is prohibitively 
expensive. This extension mechanism is not currently exposed outside the crate, 
as it is relatively unstable. If you use this interface you will need to 
perform record reassembly yourself
   
   > page reader
   
   I presume you're referring to the 
[file](https://docs.rs/parquet/latest/parquet/file/index.html) APIs here. If so 
these are still actively developed, as they are used by the arrow API without 
any major caveats when operating on local files.
   
   > are there any other
   
   The only high-level interface that I would describe as actively maintained 
is [arrow](https://docs.rs/parquet/latest/parquet/arrow/index.html), and is 
where most development effort is currently focused, with significant effort 
expended to make it fast, feature complete, and add advanced functionality such 
as predicate pushdown, async IO, etc... Whilst arrow may be a somewhat heavy 
dependency, there are ongoing improvements in this space, and I believe the 
additional performance, especially for dictionary encoded or variable length 
types, more than makes up for this.
   
   Perhaps we could add more feature flags to arrow-rs to reduce the size of it 
as a dependency, would it then work for your use-case?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] tustvold commented on issue #2394: non-annotated Repeated fields are read incorrectly

Reply via email to