On Monday, 2 March 2020 at 21:33:37 UTC, Steven Schveighoffer
wrote:
On 3/2/20 3:52 PM, aliak wrote:
On Monday, 2 March 2020 at 15:47:26 UTC, Steven Schveighoffer
wrote:
On 3/2/20 6:52 AM, Andrea Fontana wrote:
On Saturday, 29 February 2020 at 20:11:24 UTC, Steven
Schveighoffer wrote:
1. in is supposed to be O(lg(n)) or better. Generic code
may depend on this property. Searching an array is O(n).
Probably it should work if we're using a "SortedRange".
int[] a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9];
auto p = assumeSorted(a);
assert(3 in p);
That could work. Currently, you need to use p.contains(3).
opIn could be added as a shortcut.
It only makes sense if you have it as a literal though, as
p.contains(3) isn't that bad to use:
assert(3 in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9].assumeSorted);
There's no guarantee that checking if a value is in a sorted
list is any faster than checking if it's in a non sorted list.
It's why sort usually switches from a binary-esque algorithm
to a linear one at a certain size.
Well of course! A binary search needs Lg(n) comparisons for
pretty much any value, whereas a linear search is going to end
early when it finds it. So there's no guarantee that searching
for an element in the list is going to be faster one way or the
other. But Binary search is going to be faster overall because
the complexity is favorable.
Overall tending towards infinity maybe, but not overall on the
average case it would seem. Branch prediction in CPUs changes
that in that with a binary search it is always a miss. Whereas
with linear it's always a hit.
The list could potentially need to be _very_ large for
p.contains to make a significant impact over canFind(p) AFAIK.
Here's a small test program, try playing with the numbers and
see what happens:
import std.random;
import std.range;
import std.algorithm;
import std.datetime.stopwatch;
import std.stdio;
void main()
{
auto count = 1_000;
auto max = int.max;
alias randoms = generate!(() => uniform(0, max));
auto r1 = randoms.take(count).array;
auto r2 = r1.dup.sort;
auto elem = r1[uniform(0, count)];
auto elem = r1[$-1]; // try this instead
benchmark!(
() => r1.canFind(elem),
() => r2.contains(elem),
)(1_000).writeln;
}
Use LDC and -O3 of course. I was hard pressed to get the
sorted contains to be any faster than canFind.
This begs the question then: do these requirements on in make
any sense? An algorithm can be log n (ala the sorted search)
but still be a magnitude slower than a linear search... what
has the world come to 🤦♂️
PS: Why is it named contains if it's on a SortedRange and
canFind otherwise?
A SortedRange uses O(lgn) steps vs. canFind which uses O(n)
steps.
canFind is supposed to tell the reader that it's O(n) and
contains O(lgn)?
If you change your code to testing 1000 random numbers, instead
of a random number guaranteed to be included, then you will see
a significant improvement with the sorted version. I found it
to be about 10x faster. (most of the time, none of the other
random numbers are included). Even if you randomly select 1000
numbers from the elements, the binary search will be faster. In
my tests, it was about 5x faster.
Hmm... What am I doing wrong with this code? And also how are you
compiling?:
void main()
{
auto count = 1_000_000;
auto max = int.max;
alias randoms = generate!(() => uniform(0, max - 1));
auto r1 = randoms.take(count).array;
auto r2 = r1.dup.sort;
auto r3 = r1.dup.randomShuffle;
auto results = benchmark!(
() => r1.canFind(max),
() => r2.contains(max),
() => r3.canFind(max),
)(5_000);
results.writeln;
}
$ ldc2 -O3 test.d && ./test
[1 hnsec, 84 μs and 7 hnsecs, 0 hnsecs]
Note that the compiler can do a lot more tricks for linear
searches, and CPUs are REALLY good at searching sequential
data. But complexity is still going to win out eventually over
heuristics. Phobos needs to be a general library, not one that
only caters to certain situations.
General would be the most common case. I don't think extremely
large (for some definition of large) lists are the more common
ones. Or maybe they are. But I'd be surprised. I also don't think
phobos is a very data-driven library. But, that's a whole other
conversation :)
-Steve