[I] parquet_derive: support reading selected columns from parquet file [arrow-rs]

via GitHub Sun, 18 Aug 2024 07:34:54 -0700


double-free opened a new issue, #6268:
URL: https://github.com/apache/arrow-rs/issues/6268


   
   # Feature Description
   
   I'm effectively using `parquet_derive` in my project, and I found that there 
are two inconvenient constraints:
   
   1. The `ParquetRecordReader` enforces the struct to organize fields exactly 
in the **same order** in the parquet file.
   2. The `ParquetRecordReader` enforces the struct to parse **all fields** in 
the parquet file.
   
   As describe in its document:
   
   > Derive flat, simple RecordReader implementations. Works by parsing a 
struct tagged with #[derive(ParquetRecordReader)] and emitting the correct 
writing code for each field of the struct. Column readers are generated in the 
order they are defined.
   
   In my use cases (and I believe these are common requests), user should be 
able to read pruned parquet file, and they should have the freedom to 
re-organize fields' ordering in decoded struct.
   
   # My Solution
   
   I introduced a `HashMap` to map field name to its index. Of course, it 
assumes field name is unique, and this is always true since the current 
`parquet_derive` macro is applied to a flat struct without nesting.
   
   # Pros and Cons
   
   Obviously removing those two constraints makes `parquet_derive` a more handy 
tool.
   
   But it has some implied changes:
   
   - previously, since the `ParquetRecordReader` relies only on the index of 
fields, it allows that a field is named as `abc` to implicitly rename itself to 
`bcd` in the encoded struct. After this change, user must guarantee that the 
field name in `ParquetRecordReader` to exist in parquet columns.
     - I think it is more intuitive and more natural to constrain the "field 
name" rather than "index", if we use `ParquetRecordReader` to derive a decoder 
macro.
   - allowing reading partial parquet file may improve the performance for some 
users, but introducing a `HashMap` in the parser may slowdown the function a 
bit.
     - when the `num_records` in a single parsing call is large enough, the 
cost of `HashMap` lookup is negligible.
   
   Both implied changes seem to have a more positive impact than negative 
impact. Please review if this is a reasonable feature request. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] parquet_derive: support reading selected columns from parquet file [arrow-rs]

Reply via email to