On Sat, 9 Apr 2022, Raul Miller wrote:
People will say that certain algorithms, such as hashing, are highly
efficient. But these assertions are quite often not accompanied by
adequate benchmarking on large datasets. And, these approaches often
have inefficient worst case behavior.
Hashing is O(1) (or, if you prefer, O(#y) for ~.y, same as sorting). A
sufficiently smart(tm) hash function will avoid inordinate collision
rates, so I am not sure what worst case behaviour you are referring to.
If you are able to share your 1e8-sized dataset where sorting before
removing duplicates was faster than using ~., that would be great. Maybe
I can make it faster :)
For me, ~. on a vector of 1e8 integers is always faster than /:~ alone,
regardless of the duplication rate, but the specifics depend on the shape
of your data (matrix? boxed? float? note /: is intolerant but i. et al are
not by default).
(I will note that the algorithm I am thinking of would only be useful for
large datasets--it is useless for small ones--and it has a distinct
advantage over sorting in that domain.)
All that aside, though, I think the original question has value even if it
turns out that an in-order ~. is always as fast as an out-of-order one.
_In general_, putting items in a certain order creates information, and it
may be possible to profit by avoiding the need to create that information.
It occurs to me that there is a cop-out method: use !., as in ~.!.1. This
has precedent in i.!.1, but I do not like it. And I think the method can
also be used for -., resulting in ambiguities: should x -.!.1 y mean the
order of the result does not matter, or that the arguments are sorted? (I
know there is currently no -.!.1, but it could plausibly exist.)
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm