lidavidm opened a new pull request #11704: URL: https://github.com/apache/arrow/pull/11704
This implements the following: - Being able to project and filter on nested fields in the scanner/query engine. Parquet, ORC, and Feather are supported. CSV does not support reading any nested types. For Parquet, we will materialize only the leaf nodes necessary for the projection. For ORC and Feather, we will read the entire top-level column. The following are not implemented: - Normally, the scanner can fill in a column of nulls if a requested column does not exist in a file. This is not supported for nested field refs because we need ARROW-1888 to be implemented. - A nested field ref cannot be used as a key/target of an aggregation or join. Their respective nodes currently compute a FieldPath to resolve a FieldRef, but then throw away the path, keeping only the first index. To implement this, we would need to store the FieldPath and use the struct_field kernel to resolve the actual array, however, this will have more overhead and we should be careful about regressions here, especially in the common case of no nested field refs. - Only FieldRefs consisting of field names are supported. For FieldPath (= a sequence of indices), the semantics are unclear. So far, the scanner is robust to individual files having fields in a different order than the overall dataset, but this won't work for FieldPath, so either we must require that the schema is consistent across files, or come up with some way to map file schemas onto the dataset schema so that indices have a consistent meaning. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
