nealrichardson commented on issue #34437: URL: https://github.com/apache/arrow/issues/34437#issuecomment-1478746036
@westonpace I started wiring these up and have run into a couple of issues. If these are user error, I'll appreciate the pointers to resolving them. The change to use these nodes is a pure refactor, in that I am expecting all existing tests to pass when using these nodes--and there are a lot of tests. I haven't gotten to everything yet but am hitting a couple of errors in C++: * FetchNode requires order_by first and raises a validation error if ordering is not set, which means a basic `SELECT * LIMIT 10` wouldn't work. I recognize that this operation is non-deterministic, as the validation message states, but it should still be permitted, right? This is causing the tests of `head()` to fail. * This is not a C++ regression I don't think, but for order_by, I get an error in one test that seems to be sorting on a dictionary column. I notice that #29887 is open, and I think the reason the error is happening now has to do with a quirk in how the query is assembled in R--in the current implementation, we replace the sort on the dictionary column with a different sort before the sink node, so the unsupported sort doesn't end up happening. But I wanted to flag that as a nuisance that I'm surprised we haven't had to address yet. ``` ── 1. Error (test-dplyr-summarize.R:617:3): min() and max() on character strings ─────────────────── Error in `compute.arrow_dplyr_query(x)`: Type error: Sorting not supported for type dictionary<values=string, indices=int8, ordered=0> ``` Also, am I correct that `select_k_sink_node` should be dropped and that there is no replacement select_k_node? IIRC it was never actually optimized and was just doing a full sort of the data, so while there would in theory be an efficient way to optimize an "order_by" then "fetch" as TopK, we don't have that implemented in Acero. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
