Re: in not working for arrays is silly, change my view

aliak via Digitalmars-d-learn Mon, 02 Mar 2020 14:26:06 -0800

On Monday, 2 March 2020 at 21:33:37 UTC, Steven Schveighofferwrote:

On 3/2/20 3:52 PM, aliak wrote:
On Monday, 2 March 2020 at 15:47:26 UTC, Steven Schveighofferwrote:
On 3/2/20 6:52 AM, Andrea Fontana wrote:
On Saturday, 29 February 2020 at 20:11:24 UTC, StevenSchveighoffer wrote:
1. in is supposed to be O(lg(n)) or better. Generic codemay depend on this property. Searching an array is O(n).
Probably it should work if we're using a "SortedRange".


int[] a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9];
auto p = assumeSorted(a);

assert(3 in p);
That could work. Currently, you need to use p.contains(3).opIn could be added as a shortcut.
It only makes sense if you have it as a literal though, asp.contains(3) isn't that bad to use:
assert(3 in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9].assumeSorted);
There's no guarantee that checking if a value is in a sortedlist is any faster than checking if it's in a non sorted list.It's why sort usually switches from a binary-esque algorithmto a linear one at a certain size.
Well of course! A binary search needs Lg(n) comparisons forpretty much any value, whereas a linear search is going to endearly when it finds it. So there's no guarantee that searchingfor an element in the list is going to be faster one way or theother. But Binary search is going to be faster overall becausethe complexity is favorable.

Overall tending towards infinity maybe, but not overall on theaverage case it would seem. Branch prediction in CPUs changesthat in that with a binary search it is always a miss. Whereaswith linear it's always a hit.

The list could potentially need to be _very_ large forp.contains to make a significant impact over canFind(p) AFAIK.
Here's a small test program, try playing with the numbers andsee what happens:
import std.random;
import std.range;
import std.algorithm;
import std.datetime.stopwatch;
import std.stdio;

void main()
{
     auto count = 1_000;
     auto max = int.max;

     alias randoms = generate!(() => uniform(0, max));

     auto r1 = randoms.take(count).array;
     auto r2 = r1.dup.sort;
     auto elem = r1[uniform(0, count)];
auto elem = r1[$-1]; // try this instead
     benchmark!(
         () => r1.canFind(elem),
         () => r2.contains(elem),
     )(1_000).writeln;
}
Use LDC and -O3 of course. I was hard pressed to get thesorted contains to be any faster than canFind.
This begs the question then: do these requirements on in makeany sense? An algorithm can be log n (ala the sorted search)but still be a magnitude slower than a linear search... whathas the world come to 🤦‍♂️
PS: Why is it named contains if it's on a SortedRange andcanFind otherwise?
A SortedRange uses O(lgn) steps vs. canFind which uses O(n)steps.

canFind is supposed to tell the reader that it's O(n) andcontains O(lgn)?

If you change your code to testing 1000 random numbers, insteadof a random number guaranteed to be included, then you will seea significant improvement with the sorted version. I found itto be about 10x faster. (most of the time, none of the otherrandom numbers are included). Even if you randomly select 1000numbers from the elements, the binary search will be faster. Inmy tests, it was about 5x faster.

Hmm... What am I doing wrong with this code? And also how are youcompiling?:


void main()
{
    auto count = 1_000_000;
    auto max = int.max;

    alias randoms = generate!(() => uniform(0, max - 1));

    auto r1 = randoms.take(count).array;
    auto r2 = r1.dup.sort;
    auto r3 = r1.dup.randomShuffle;

    auto results = benchmark!(
        () => r1.canFind(max),
        () => r2.contains(max),
        () => r3.canFind(max),
    )(5_000);

    results.writeln;
}


$ ldc2 -O3 test.d && ./test
[1 hnsec, 84 μs and 7 hnsecs, 0 hnsecs]

Note that the compiler can do a lot more tricks for linearsearches, and CPUs are REALLY good at searching sequentialdata. But complexity is still going to win out eventually overheuristics. Phobos needs to be a general library, not one thatonly caters to certain situations.

General would be the most common case. I don't think extremelylarge (for some definition of large) lists are the more commonones. Or maybe they are. But I'd be surprised. I also don't thinkphobos is a very data-driven library. But, that's a whole otherconversation :)


-Steve

Re: in not working for arrays is silly, change my view

Reply via email to