nealrichardson commented on PR #33770:
URL: https://github.com/apache/arrow/pull/33770#issuecomment-1398606967

   > Do you want some unit tests?
   
   Of course, this needs some. The tests that were added for this function when 
it was introduced 
(https://github.com/apache/arrow/commit/8972ebd81202cdfb2e59c77eb0475120525e6230#diff-273815dd1d6770a7ef790980d9039adddf0ef8efa88c0745234906ab16ac09dfR2784-R2838)
 are more e2e than unit tests of the function, but I could try to tack on 
something there with a struct column and a nested ref. Happy to take a 
different approach if you have suggestions though. I haven't run C++ unit tests 
in forever, so figured I'd get some feedback before diving in there. 
   
   > 
   > This seems like it will load the entire column into memory (since you're 
using the top-level name). 
   
   True, and this is how it currently works from R anyway. The effect of this 
change is to not fail when other languages/libraries try to scan with nested 
refs.
   
   > For formats that support nested load (e.g. parquet) I thought we had a 
better implementation that already worked. For example:
   > 
   > 
https://github.com/apache/arrow/blob/9a1373452ff5b4cf41cc371e0585d8dda91ffd36/cpp/src/arrow/dataset/file_parquet.cc#L252
   > 
   > seems to be expecting nested refs in the projection. But I could be 
misunderstanding. The expression-based projection of the old scanner always 
confused me a little.
   
   @jorisvandenbossche mentioned this in my previous PR, and that's why I 
wanted to send nested refs instead of top-level columns. So why aren't I 
hitting that code? I'm creating a ScanNode for an ExecPlan. Or am I only 
hitting this code because I'm testing with an InMemoryDataset? Clearly I'm not 
hitting file_parquet.cc with that, but does a parquet FileSystemDataset avoid 
this scanner code entirely? I'm skeptical about that (though it should be 
easily verified) since this function was added to fix an issue with unbound 
schemas that IIRC people were experiencing with actual parquet datasets.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to