jhorstmann commented on pull request #1048:
URL: https://github.com/apache/arrow-rs/pull/1048#issuecomment-998310199


   Thanks for the review @alamb, I will go through the individual comments 
later. I was still doing some profiling because I expected a bigger speedup. It 
seems the `DynComparator` and `LexicographicalComparator` take longer than the 
difference between comparing strings and integers. I was also experimenting 
with trying to inline `is_valid` calls on another branch to see if that is the 
bottleneck.
   
   The reason for extending the `SortOptions` is that this functionality helps 
for two separate usecases. The name of the flag is a bit misleading and another 
reason why the PR is still in draft. First usecase is of course for faster 
sorting if the dictionaries are already sorted, for that the `is_ordered` flag 
on `DictionaryArray` would work fine. The other usecase is for partitioning, as 
used for example for window functions. In an expression like `SUM(a) OVER 
(PARTITION BY b ORDER BY c)` Datafusion will sort by `b, c`, with this change 
we could track that `b` is only used for partitioning in the logical plan and 
then set the new flag in `SortOptions`, regardless of whether the dictionary is 
actually sorted.
   
   Now that I'm thinking about, for that usecase we could also pretend that the 
dictionary is sorted by creating a new DictionaryArray, pointing to the same 
data, but with the `is_ordered` flag set, and sort by that array.
   
   Renaming the flag to `sort_for_partitioning` or `sort_by_dictionary_keys` 
could be an option to make the purpose clearer. And the `is_ordered` flag 
should also be taken into account, regardless of the `SortOption` flag.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to