lidavidm opened a new pull request #11704:
URL: https://github.com/apache/arrow/pull/11704


   This implements the following:
   
   - Being able to project and filter on nested fields in the scanner/query 
engine.
   
   Parquet, ORC, and Feather are supported. CSV does not support reading any 
nested types. For Parquet, we will materialize only the leaf nodes necessary 
for the projection. For ORC and Feather, we will read the entire top-level 
column.
   
   The following are not implemented:
   - Normally, the scanner can fill in a column of nulls if a requested column 
does not exist in a file. This is not supported for nested field refs because 
we need ARROW-1888 to be implemented.
   - A nested field ref cannot be used as a key/target of an aggregation or 
join. Their respective nodes currently compute a FieldPath to resolve a 
FieldRef, but then throw away the path, keeping only the first index. To 
implement this, we would need to store the FieldPath and use the struct_field 
kernel to resolve the actual array, however, this will have more overhead and 
we should be careful about regressions here, especially in the common case of 
no nested field refs.
   - Only FieldRefs consisting of field names are supported. For FieldPath (= a 
sequence of indices), the semantics are unclear. So far, the scanner is robust 
to individual files having fields in a different order than the overall 
dataset, but this won't work for FieldPath, so either we must require that the 
schema is consistent across files, or come up with some way to map file schemas 
onto the dataset schema so that indices have a consistent meaning.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to