nealrichardson commented on PR #33770: URL: https://github.com/apache/arrow/pull/33770#issuecomment-1400625313
> @nealrichardson Ok, I did some investigation. > > First, the reason this is not being encountered from pyarrow: > > The scanner options currently takes both a projected schema and a projection expression. R sets the projection expression (and so the C++ needs to figure out the projected schema) and python sets the projected schema (and C++ needs to figure out the projection expression). So pyarrow never encounters the code you are modifying (to the best of my knowledge). What's odd is that the the projection provided to the ScanNode isn't what the ScanNode returns. The function I changed here returns a schema, but it is not the schema that would result from the projection--it's the schema of the fields referenced in the projection expression. (The scanner also adds the augmented fields, so the schema of the data that comes out of the ScanNode is also different from that.) So I don't know why you need the projection expression at all, unless it is aspirational/future-looking for some time when the projection can be pushed down and handled by the file readers or whatever. > > Second, the concern about loading the entire top-level field: > > It turns out that partial column loading was [never fully implemented anyways](https://github.com/apache/arrow/blob/apache-arrow-11.0.0/cpp/src/arrow/dataset/file_parquet.cc#L240-L247). So even though we go through all the trouble of figuring out exactly which child to load, we still just load the entire top-level field. > > That being said, if R is working as you expect, then I approve this approach. We can ship this but I wonder if it wouldn't be better just to remove this projection interface and only accept a schema, which may filter out top-level fields only, no other projection. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
