nealrichardson commented on issue #34437:
URL: https://github.com/apache/arrow/issues/34437#issuecomment-1478746036

   @westonpace I started wiring these up and have run into a couple of issues. 
If these are user error, I'll appreciate the pointers to resolving them. The 
change to use these nodes is a pure refactor, in that I am expecting all 
existing tests to pass when using these nodes--and there are a lot of tests. I 
haven't gotten to everything yet but am hitting a couple of errors in C++:
   
   * FetchNode requires order_by first and raises a validation error if 
ordering is not set, which means a basic `SELECT * LIMIT 10` wouldn't work. I 
recognize that this operation is non-deterministic, as the validation message 
states, but it should still be permitted, right? This is causing the tests of 
`head()` to fail.
   * This is not a C++ regression I don't think, but for order_by, I get an 
error in one test that seems to be sorting on a dictionary column. I notice 
that #29887 is open, and I think the reason the error is happening now has to 
do with a quirk in how the query is assembled in R--in the current 
implementation, we replace the sort on the dictionary column with a different 
sort before the sink node, so the unsupported sort doesn't end up happening. 
But I wanted to flag that as a nuisance that I'm surprised we haven't had to 
address yet.
   
   ```
   ── 1. Error (test-dplyr-summarize.R:617:3): min() and max() on character 
strings ───────────────────
   Error in `compute.arrow_dplyr_query(x)`: Type error: Sorting not supported 
for type dictionary<values=string, indices=int8, ordered=0>
   ```
   
   Also, am I correct that `select_k_sink_node` should be dropped and that 
there is no replacement select_k_node? IIRC it was never actually optimized and 
was just doing a full sort of the data, so while there would in theory be an 
efficient way to optimize an "order_by" then "fetch" as TopK, we don't have 
that implemented in Acero.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to