vustef commented on issue #7299: URL: https://github.com/apache/arrow-rs/issues/7299#issuecomment-3446828518
Poked a little bit more with the code. `ArrowReaderOptions::new().with_schema` that @jkylling proposed doesn't seem to be a good fit. The reason is this comment: ``` The provided schema must have the same number of columns as the parquet schema and the column names must be the same. ``` So what we can do is introduce another method, e.g. `with_metadata_columns` proposed above, that would append metadata (aka virtual) columns to the end of the schema parsed from the parquet. If it's a method on the builder, then the builder has to take existing schema (which is produced during construction of the builder), and derive new schema (`schema: SchemaRef`) and `fields: Option<Arc<ParquetField>>`) based on it. The `ProjectionMask` is then applied only to the parquet schema. E.g. if you'd want to query only virtual columns, you'd do: `.with_projection(ProjectionMask::none(num_columns))`. I think this makes more sense than the alternative, because virtual columns are not part of `SchemaDescriptor`, and `ProjectionMask` relies on `SchemaDescriptor` to be constructed in some cases. Also, users simply don't have to add virtual columns with `with_metadata_columns` in the first place if they want to project them away. I don't have a good intuition whether `with_metadata_columns` should be on `ParquetRecordBatchReaderBuilder` or `ArrowReaderOptions`. Seems more flexible to keep it on the builder, since one always has that, but may not use ArrowReaderOptions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
