Hi all (my first message in this list), I have written some (Python) code to make lexsort on (lists of arrays of) integers more efficient, by "merging" all levels in a single array. I'm asking for some feedback between doing a PR.
Compared to the current code, there is an improvement as long as: - the arrays (levels) are long at least 10^3 - the improvement becomes significant (~3x) above 10^4 - the sum of bits required for the representation of each level is no larger than the platform uint's bits; otherwise, my code falls back to Python int, and becomes way less performant. My rough guess it that the second condition holds in the vast majority of use cases (including a planned use in scipy - see https://github.com/scipy/scipy/issues/20151 for context). The code is here: https://github.com/toobaz/numpy/commit/aa5c6948 I know that the current form is not well written - at least _int_lexsort seems out of place, and imports placed inside the function - but I didn't know in which file it would belong (and anyway, this version makes a quick glance easy). In any case, feedback is welcome as to what should go where. For the records, the approach is similar to that employed in pandas MultiIndexes: https://github.com/pandas-dev/pandas/pull/19074 API-wise, there are various possibilities to do this: - the current status is that a new parameter "int_path" is added, and this determines which path to take; so nothing changes unless the user wants to use the new path - alternatively, I could check if all arrays are int (O(1)) and their maxima/minima (O(n)) and use the optimized code path only if it relies on uint, not Python int. The downside is that I would (slightly?) slow down those cases which do fallback to the old code. - moreover, in the current version "int_path" can take as value either a bool or a sorting algorithm, as understood by np.argsort. But an alternative is that we add a new argument "kind" to lexsort (consistently with argsort), which at the moment would raise NotImplementedError in the "old" path. By the way, my C is too rusty to really understand what goes on here: https://github.com/numpy/numpy/blob/main/numpy/_core/src/multiarray/item_selection.c#L1796 ... but maybe it wouldn't be so difficult to support switching "kind" also in the "general" lexsort (old code path). - note that lexsort is guaranteed to be stable, while the new code path is stable only if "kind" is set to e.g. 'stable' (the default for argsort being 'quicksort') The current code misses tests (I'll be happy to add some once someone confirms that there is interest for this contribution), but some (performance) tests I ran are here: https://github.com/toobaz/int_lexsort/blob/main/Show%20benchmarks.ipynb Thanks for your attention, Pietro Battiston _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com