[GitHub] [arrow] westonpace commented on pull request #33770: GH-33760: [R][C++] Handle nested field refs in scanner

2023-01-23 Thread via GitHub


westonpace commented on PR #33770:
URL: https://github.com/apache/arrow/pull/33770#issuecomment-1400855903

   I think we should merge as-is since we don't want to let pending work block 
current capabilities.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] westonpace commented on pull request #33770: GH-33760: [R][C++] Handle nested field refs in scanner

2023-01-23 Thread via GitHub


westonpace commented on PR #33770:
URL: https://github.com/apache/arrow/pull/33770#issuecomment-1400796896

   > Or should I be using ScanV2Options?
   
   Yes, but it's not ready yet.  Sorry if that was misleading.  By "new" I 
meant "the API we are working towards".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] westonpace commented on pull request #33770: GH-33760: [R][C++] Handle nested field refs in scanner

2023-01-23 Thread via GitHub


westonpace commented on PR #33770:
URL: https://github.com/apache/arrow/pull/33770#issuecomment-1400632307

   > So I don't know why you need the projection expression at all, unless it 
is aspirational/future-looking for some time when the projection can be pushed 
down and handled by the file readers or whatever.
   
   > We can ship this but I wonder if it wouldn't be better just to remove this 
projection interface and only accept a schema, which may filter out top-level 
fields only, no other projection.
   
   It's less aspirational and more historical.  This comes from the 
pre-ExecPlan days when the scanner was "scan->filter->project".  The new scan 
node API simply accepts a vector of field refs.  However, it should support 
those field refs being nested refs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] westonpace commented on pull request #33770: GH-33760: [R][C++] Handle nested field refs in scanner

2023-01-21 Thread via GitHub


westonpace commented on PR #33770:
URL: https://github.com/apache/arrow/pull/33770#issuecomment-1399270231

   @nealrichardson Ok, I did some investigation.
   
   First, the reason this is not being encountered from pyarrow:
   
   The scanner options currently takes both a projected schema and a projection 
expression.  R sets the projection expression (and so the C++ needs to figure 
out the projected schema) and python sets the projected schema (and C++ needs 
to figure out the projection expression).  So pyarrow never encounters the code 
you are modifying (to the best of my knowledge).
   
   Second, the concern about loading the entire top-level field:
   
   It turns out that partial column loading was [never fully implemented 
anyways](https://github.com/apache/arrow/blob/apache-arrow-11.0.0/cpp/src/arrow/dataset/file_parquet.cc#L240-L247).
  So even though we go through all the trouble of figuring out exactly which 
child to load, we still just load the entire top-level field.
   
   That being said, if R is working as you expect, then I approve this approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] westonpace commented on pull request #33770: GH-33760: [R][C++] Handle nested field refs in scanner

2023-01-20 Thread via GitHub


westonpace commented on PR #33770:
URL: https://github.com/apache/arrow/pull/33770#issuecomment-1399077722

   > I thought this was just going to be deleting code from the R package: 
instead of finding the top-level field names in the projection and sending them 
in the ScanNode, I'd send the projection and the scanner would pull out the 
fields.
   
   Got it, so this is not a behavior change, but moving the logic from R into 
C++ where it would appropriately belong?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] westonpace commented on pull request #33770: GH-33760: [R][C++] Handle nested field refs in scanner

2023-01-20 Thread GitBox


westonpace commented on PR #33770:
URL: https://github.com/apache/arrow/pull/33770#issuecomment-1398740288

   > I haven't run C++ unit tests in forever, so figured I'd get some feedback 
before diving in there.
   
   Sorry, I was thinking of R e2e tests.  I would hope the C++ change is 
covered by existing tests.  Although I think we've found in the past that it is 
easy to accidentally load too much from the disk and still pass the tests.
   
   > @jorisvandenbossche mentioned this in my previous PR, and that's why I 
wanted to send nested refs instead of top-level columns. So why aren't I 
hitting that code?
   
   I don't know sadly.  I will try and investigate later today.  I could tell 
you how it works in the new scan node :laughing: but I don't think that will be 
too useful to you yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org