Hi all (my first message in this list),

I have written some (Python) code to make lexsort on (lists of arrays of) 
integers more efficient, by "merging" all levels in a single array. I'm asking 
for some feedback between doing a PR.

Compared to the current code, there is an improvement as long as:
- the arrays (levels) are long at least 10^3 - the improvement becomes 
significant (~3x) above 10^4
- the sum of bits required for the representation of each level is no larger 
than the platform uint's bits; otherwise, my code falls back to Python int, and 
becomes way less performant.

My rough guess it that the second condition holds in the vast majority of use 
cases (including a planned use in scipy - see 
https://github.com/scipy/scipy/issues/20151 for context).

The code is here: https://github.com/toobaz/numpy/commit/aa5c6948

I know that the current form is not well written - at least _int_lexsort seems 
out of place, and imports placed inside the function - but I didn't know in 
which file it would belong (and anyway, this version makes a quick glance 
easy). In any case, feedback is welcome as to what should go where.

For the records, the approach is similar to that employed in pandas 
MultiIndexes: https://github.com/pandas-dev/pandas/pull/19074

API-wise, there are various possibilities to do this:

- the current status is that a new parameter "int_path" is added, and this 
determines which path to take; so nothing changes unless the user wants to use 
the new path

- alternatively, I could check if all arrays are int (O(1)) and their 
maxima/minima (O(n)) and use the optimized code path only if it relies on uint, 
not Python int. The downside is that I would (slightly?) slow down those cases 
which do fallback to the old code.

- moreover, in the current version "int_path" can take as value either a bool 
or a sorting algorithm, as understood by np.argsort. But an alternative is that 
we add a new argument "kind" to lexsort (consistently with argsort), which at 
the moment would raise NotImplementedError in the "old" path. By the way, my C 
is too rusty to really understand what goes on here:
https://github.com/numpy/numpy/blob/main/numpy/_core/src/multiarray/item_selection.c#L1796
... but maybe it wouldn't be so difficult to support switching "kind" also in 
the "general" lexsort (old code path).

- note that lexsort is guaranteed to be stable, while the new code path is 
stable only if "kind" is set to e.g. 'stable' (the default for argsort being 
'quicksort')

The current code misses tests (I'll be happy to add some once someone confirms 
that there is interest for this contribution), but some (performance) tests I 
ran are here:
https://github.com/toobaz/int_lexsort/blob/main/Show%20benchmarks.ipynb

Thanks for your attention,

Pietro Battiston
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

Reply via email to