nealrichardson commented on PR #33770:
URL: https://github.com/apache/arrow/pull/33770#issuecomment-1400625313

   > @nealrichardson Ok, I did some investigation.
   > 
   > First, the reason this is not being encountered from pyarrow:
   > 
   > The scanner options currently takes both a projected schema and a 
projection expression. R sets the projection expression (and so the C++ needs 
to figure out the projected schema) and python sets the projected schema (and 
C++ needs to figure out the projection expression). So pyarrow never encounters 
the code you are modifying (to the best of my knowledge).
   
   What's odd is that the the projection provided to the ScanNode isn't what 
the ScanNode returns. The function I changed here returns a schema, but it is 
not the schema that would result from the projection--it's the schema of the 
fields referenced in the projection expression. (The scanner also adds the 
augmented fields, so the schema of the data that comes out of the ScanNode is 
also different from that.) So I don't know why you need the projection 
expression at all, unless it is aspirational/future-looking for some time when 
the projection can be pushed down and handled by the file readers or whatever. 
   
   > 
   > Second, the concern about loading the entire top-level field:
   > 
   > It turns out that partial column loading was [never fully implemented 
anyways](https://github.com/apache/arrow/blob/apache-arrow-11.0.0/cpp/src/arrow/dataset/file_parquet.cc#L240-L247).
 So even though we go through all the trouble of figuring out exactly which 
child to load, we still just load the entire top-level field.
   > 
   > That being said, if R is working as you expect, then I approve this 
approach.
   
   We can ship this but I wonder if it wouldn't be better just to remove this 
projection interface and only accept a schema, which may filter out top-level 
fields only, no other projection. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to