[GitHub] [arrow] westonpace commented on pull request #33770: GH-33760: [R][C++] Handle nested field refs in scanner
westonpace commented on PR #33770: URL: https://github.com/apache/arrow/pull/33770#issuecomment-1400855903 I think we should merge as-is since we don't want to let pending work block current capabilities. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on pull request #33770: GH-33760: [R][C++] Handle nested field refs in scanner
westonpace commented on PR #33770: URL: https://github.com/apache/arrow/pull/33770#issuecomment-1400796896 > Or should I be using ScanV2Options? Yes, but it's not ready yet. Sorry if that was misleading. By "new" I meant "the API we are working towards". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on pull request #33770: GH-33760: [R][C++] Handle nested field refs in scanner
westonpace commented on PR #33770: URL: https://github.com/apache/arrow/pull/33770#issuecomment-1400632307 > So I don't know why you need the projection expression at all, unless it is aspirational/future-looking for some time when the projection can be pushed down and handled by the file readers or whatever. > We can ship this but I wonder if it wouldn't be better just to remove this projection interface and only accept a schema, which may filter out top-level fields only, no other projection. It's less aspirational and more historical. This comes from the pre-ExecPlan days when the scanner was "scan->filter->project". The new scan node API simply accepts a vector of field refs. However, it should support those field refs being nested refs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on pull request #33770: GH-33760: [R][C++] Handle nested field refs in scanner
westonpace commented on PR #33770: URL: https://github.com/apache/arrow/pull/33770#issuecomment-1399270231 @nealrichardson Ok, I did some investigation. First, the reason this is not being encountered from pyarrow: The scanner options currently takes both a projected schema and a projection expression. R sets the projection expression (and so the C++ needs to figure out the projected schema) and python sets the projected schema (and C++ needs to figure out the projection expression). So pyarrow never encounters the code you are modifying (to the best of my knowledge). Second, the concern about loading the entire top-level field: It turns out that partial column loading was [never fully implemented anyways](https://github.com/apache/arrow/blob/apache-arrow-11.0.0/cpp/src/arrow/dataset/file_parquet.cc#L240-L247). So even though we go through all the trouble of figuring out exactly which child to load, we still just load the entire top-level field. That being said, if R is working as you expect, then I approve this approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on pull request #33770: GH-33760: [R][C++] Handle nested field refs in scanner
westonpace commented on PR #33770: URL: https://github.com/apache/arrow/pull/33770#issuecomment-1399077722 > I thought this was just going to be deleting code from the R package: instead of finding the top-level field names in the projection and sending them in the ScanNode, I'd send the projection and the scanner would pull out the fields. Got it, so this is not a behavior change, but moving the logic from R into C++ where it would appropriately belong? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on pull request #33770: GH-33760: [R][C++] Handle nested field refs in scanner
westonpace commented on PR #33770: URL: https://github.com/apache/arrow/pull/33770#issuecomment-1398740288 > I haven't run C++ unit tests in forever, so figured I'd get some feedback before diving in there. Sorry, I was thinking of R e2e tests. I would hope the C++ change is covered by existing tests. Although I think we've found in the past that it is easy to accidentally load too much from the disk and still pass the tests. > @jorisvandenbossche mentioned this in my previous PR, and that's why I wanted to send nested refs instead of top-level columns. So why aren't I hitting that code? I don't know sadly. I will try and investigate later today. I could tell you how it works in the new scan node :laughing: but I don't think that will be too useful to you yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org