jhorstmann commented on pull request #1048: URL: https://github.com/apache/arrow-rs/pull/1048#issuecomment-998310199
Thanks for the review @alamb, I will go through the individual comments later. I was still doing some profiling because I expected a bigger speedup. It seems the `DynComparator` and `LexicographicalComparator` take longer than the difference between comparing strings and integers. I was also experimenting with trying to inline `is_valid` calls on another branch to see if that is the bottleneck. The reason for extending the `SortOptions` is that this functionality helps for two separate usecases. The name of the flag is a bit misleading and another reason why the PR is still in draft. First usecase is of course for faster sorting if the dictionaries are already sorted, for that the `is_ordered` flag on `DictionaryArray` would work fine. The other usecase is for partitioning, as used for example for window functions. In an expression like `SUM(a) OVER (PARTITION BY b ORDER BY c)` Datafusion will sort by `b, c`, with this change we could track that `b` is only used for partitioning in the logical plan and then set the new flag in `SortOptions`, regardless of whether the dictionary is actually sorted. Now that I'm thinking about, for that usecase we could also pretend that the dictionary is sorted by creating a new DictionaryArray, pointing to the same data, but with the `is_ordered` flag set, and sort by that array. Renaming the flag to `sort_for_partitioning` or `sort_by_dictionary_keys` could be an option to make the purpose clearer. And the `is_ordered` flag should also be taken into account, regardless of the `SortOption` flag. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
