[GitHub] [arrow] westonpace commented on pull request #34576: GH-34519: [C++][R] Fix dataset scans that project the same name as a field

via GitHub Wed, 15 Mar 2023 12:53:19 -0700


westonpace commented on PR #34576:
URL: https://github.com/apache/arrow/pull/34576#issuecomment-1470746859


   Ok, I put in some debugging and ran your test case.  The way R is building 
scan options the value of `projection` is not "the output of the scan" but "the 
output of the scan + project".  So, in other words, given `SELECT x, y, x+y` 
the built plan would be...
   
   ```
   Scan {
     projection=make_struct(field_ref(x), field_ref(y), add(field_ref(y), 
field_ref(x))
   }
   Project {
     exprs=[
       field_ref(0),
       field_ref(1),
       add(field_ref(1), field_ref(0))
     ]
   }
   ```
   
   Under the old model we would skip `add(...)` in `ScanOptions::projection` 
because it was not a field ref.  This would have been fine for the above query. 
 However, given the query `SELECT cast(x, string)` then R would give us:
   
   ```
   Scan {
     projection=make_struct(cast(field_ref(x), string))
   }
   Project {
     exprs=[
       cast(field_ref(0), string)
     ]
   }
   ```
   
   However, we would have not realized you wanted to read in any fields at all 
because we skip `cast(...)`.  What I had expected (back when I updated the 
projection handling code) was something like...
   
   ```
   Scan {
     projection=make_struct(field_ref(x))
   }
   Project {
     exprs=[
       cast(field_ref(0), string)
     ]
   }
   ```
   
   However, that puts the burden of calculating the materialized fields 
squarely on R's shoulders.
   
   Also, this is more or less the direction the scan node itself is going.  In 
ScanV2 the "projection" will just be a list of "columns to load".
   
   Long term, this seems inevitable.  For example, I think this approach may 
fail with a query like `SELECT left.a from left INNER JOIN right on left.id = 
right.id WHERE left.b > right.b`.  The fact that you have to read `left.b` is 
not immediately evident until after the join.  So I don't think R is going to 
give us `right.b` in the `projection` for the scan options.
   
   The solution (I think) is a proper "pushdown projection" pass.  First, R 
creates a plan with a scan node that is configured to load everything.  Then, 
in a push down projection pass, the actual scan projection is calculated.  
Ideally this would be in C++ but it kind of breaks the old philosophy of 
"zero-optimization"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on pull request #34576: GH-34519: [C++][R] Fix dataset scans that project the same name as a field

Reply via email to