vustef commented on issue #7299:
URL: https://github.com/apache/arrow-rs/issues/7299#issuecomment-3446828518

   Poked a little bit more with the code.
   
   `ArrowReaderOptions::new().with_schema` that @jkylling proposed doesn't seem 
to be a good fit. The reason is this comment:
   ```
   The provided schema must have the same number of columns as the parquet 
schema and
   the column names must be the same.
   ```
   
   So what we can do is introduce another method, e.g. `with_metadata_columns` 
proposed above, that would append metadata (aka virtual) columns to the end of 
the schema parsed from the parquet.
   If it's a method on the builder, then the builder has to take existing 
schema (which is produced during construction of the builder), and derive new 
schema (`schema: SchemaRef`) and `fields: Option<Arc<ParquetField>>`) based on 
it.
   
   The `ProjectionMask` is then applied only to the parquet schema. E.g. if 
you'd want to query only virtual columns, you'd do:
   `.with_projection(ProjectionMask::none(num_columns))`. I think this makes 
more sense than the alternative, because virtual columns are not part of 
`SchemaDescriptor`, and `ProjectionMask` relies on `SchemaDescriptor` to be 
constructed in some cases. Also, users simply don't have to add virtual columns 
with `with_metadata_columns` in the first place if they want to project them 
away.
   
   I don't have a good intuition whether `with_metadata_columns` should be on 
`ParquetRecordBatchReaderBuilder` or `ArrowReaderOptions`. Seems more flexible 
to keep it on the builder, since one always has that, but may not use 
ArrowReaderOptions.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to