Re: Help with an algorithm!

Ivan Kazmenko via Digitalmars-d-learn Thu, 15 Jun 2017 04:51:08 -0700

On Thursday, 15 June 2017 at 06:06:01 UTC, MGW wrote:

There are two arrays of string [] mas1, mas2; Size of eachabout 5M lines. By the size they different, but lines in bothmatch for 95%. It is necessary to find all lines in an array ofmas2 which differ from mas1. The principal criterion - speed.There are the 8th core processor and it is good to involve amultithreading.


The approaches which come to mind are:

1. Sort both arrays, then traverse them in sorted order, like inmerge step of merge sort:


    sort (mas1);
    sort (mas2);

    size_t index1 = 0;
    foreach (str2; mas2)
    {
        while (index1 < mas1.length && mas1[index1] < str2)
            index1 += 1;
        if (mas1[index1] != str2)
            writeln (str2);
    }

Sorting takes O (n log n * c) time, where n is the size of thearrays, and c is the expected time of two strings comparison whensorting. The subsequent step is O (n * c) which is faster thansorting.

2. Hashing. Just put the contents of the first array into a bool[string], and then, for each string from the second array, checkwhether it is contained in the associative array. The time willbe O (total length of all strings) multiplied by a moderateconstant, unless the strings are designed specifically togenerate hash collisions, in which case it will be slower.

3. Trie. Similar to hashing, but the constant multipliers willbe much higher unless the strings have large common prefixes.

Whether we can do faster depends on context. For example, if thestrings tend to all have long common prefixes, any stringcomparison will be slow, but otherwise it can be thought of astaking constant time.


Ivan Kazmenko.

Re: Help with an algorithm!

Reply via email to