Greetings Dlang wizards,

I seek knowledge/understanding of a very frustrating phenomenon I've experienced over the past several days.

The problem space:

1) Read a list of strings from a file
2) De-duplicate all strings into the subset of unique strings
3) Sort the subset of unique strings by descending length and then by ascending lexicographic identity 4) Iterate through the sorted subset of unique strings, identifying smaller sequences with perfect identity to their largest possible parent string

I have written a Dlang program that performantly progresses through step #3 above. I used a built-in AA (associative array) to uniquely de-duplicate the initial set of strings and then used multiSort(). Performance was good up till this point, especially with use of the LDC compiler.

Things went sideways at step #4: because multiSort() returns a SortedRange, I used .array to convert the returned SortedRange into an array of type string[]. This appeared to work, and neither DMD nor LDC threw any warnings/errors for doing this.

With the formally returned array, I then attempted to construct a double foreach loop to iterate through the sorted array of unique strings and find substring matches.

foreach( i, ref pStr; sortedArr )
{
    foreach( j, ref cStr; sortedArr[ i + 1 .. $ ] )
    {
        if( indexOf( pStr, cStr ) > -1 )
        {
            // ...
        }
    }
}

Before adding the code excerpt above, the Dlang program was taking ~1 second on an input file containing approx. 64,000 strings.

By adding the code above, the program now takes 6 minutes to complete. An attempt was made to more efficiently perform ASCII-only substring searching by converting the sorted string[] into ubyte[][] and then using countUntil() instead of indexOf(), but this had an effect that was completely opposite to what I had previously experienced: the program then took over 20 minutes to complete!

Thus, I am entirely baffled.

My first attempt to solve this problem space used a small Perl program to perform steps 1 through 3, which would then pipe intermediate output to a small Dlang program handling only step #4 using dynamic arrays (no use of AAs) of ubyte[][] with use of countUntil().

The Dlang code for the nested foreach block above is essentially near-identical between my two Dlang implementations. Yet, the second implementation--where I'm trying to solve the entire problem space in D--has absolutely failed in terms of performance.

Perl+D, ubyte[][], countUntil() :: under 2 seconds
only D, string[], indexOf() :: ~6 minutes
only D, ubyte[][], countUntil() :: >20 minutes

Please advise. This nightmarish experience is shaking my confidence in using D.

Reply via email to