nealrichardson commented on PR #33770: URL: https://github.com/apache/arrow/pull/33770#issuecomment-1398606967
> Do you want some unit tests? Of course, this needs some. The tests that were added for this function when it was introduced (https://github.com/apache/arrow/commit/8972ebd81202cdfb2e59c77eb0475120525e6230#diff-273815dd1d6770a7ef790980d9039adddf0ef8efa88c0745234906ab16ac09dfR2784-R2838) are more e2e than unit tests of the function, but I could try to tack on something there with a struct column and a nested ref. Happy to take a different approach if you have suggestions though. I haven't run C++ unit tests in forever, so figured I'd get some feedback before diving in there. > > This seems like it will load the entire column into memory (since you're using the top-level name). True, and this is how it currently works from R anyway. The effect of this change is to not fail when other languages/libraries try to scan with nested refs. > For formats that support nested load (e.g. parquet) I thought we had a better implementation that already worked. For example: > > https://github.com/apache/arrow/blob/9a1373452ff5b4cf41cc371e0585d8dda91ffd36/cpp/src/arrow/dataset/file_parquet.cc#L252 > > seems to be expecting nested refs in the projection. But I could be misunderstanding. The expression-based projection of the old scanner always confused me a little. @jorisvandenbossche mentioned this in my previous PR, and that's why I wanted to send nested refs instead of top-level columns. So why aren't I hitting that code? I'm creating a ScanNode for an ExecPlan. Or am I only hitting this code because I'm testing with an InMemoryDataset? Clearly I'm not hitting file_parquet.cc with that, but does a parquet FileSystemDataset avoid this scanner code entirely? I'm skeptical about that (though it should be easily verified) since this function was added to fix an issue with unbound schemas that IIRC people were experiencing with actual parquet datasets. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org