On Thursday, 15 June 2017 at 06:06:01 UTC, MGW wrote:
There are two arrays of string [] mas1, mas2; Size of each about 5M lines. By the size they different, but lines in both match for 95%. It is necessary to find all lines in an array of mas2 which differ from mas1. The principal criterion - speed. There are the 8th core processor and it is good to involve a multithreading.

The approaches which come to mind are:

1. Sort both arrays, then traverse them in sorted order, like in merge step of merge sort:

    sort (mas1);
    sort (mas2);

    size_t index1 = 0;
    foreach (str2; mas2)
    {
        while (index1 < mas1.length && mas1[index1] < str2)
            index1 += 1;
        if (mas1[index1] != str2)
            writeln (str2);
    }

Sorting takes O (n log n * c) time, where n is the size of the arrays, and c is the expected time of two strings comparison when sorting. The subsequent step is O (n * c) which is faster than sorting.

2. Hashing. Just put the contents of the first array into a bool [string], and then, for each string from the second array, check whether it is contained in the associative array. The time will be O (total length of all strings) multiplied by a moderate constant, unless the strings are designed specifically to generate hash collisions, in which case it will be slower.

3. Trie. Similar to hashing, but the constant multipliers will be much higher unless the strings have large common prefixes.

Whether we can do faster depends on context. For example, if the strings tend to all have long common prefixes, any string comparison will be slow, but otherwise it can be thought of as taking constant time.

Ivan Kazmenko.

Reply via email to