Re: 200-600x slower Dlang performance with nested foreach loop

methonash via Digitalmars-d-learn Sat, 30 Jan 2021 15:15:29 -0800

Greetings all,

Many thanks for sharing your collective perspective and advicethus far! It has been very helpful and instructive. I returnbearing live data and a minimally complete, compilable, andexecutable program to experiment with and potentially optimize.The dataset can be pulled from here:


https://filebin.net/qf2km1ea9qgd5skp/seqs.fasta.gz?t=97kgpebg

Running "cksum" on this file:

1477520542 2199192 seqs.fasta.gz

Naturally, you'll need to gunzip this file. The decompressed filecontains strings on every even-numbered line that have alreadybeen reduced to the unique de-duplicated subset, and they havealready been sorted by descending length and alphabeticalidentity.

From my initial post, the focus is now entirely on step #4:finding/removing strings that can be entirely absorbed(substringed) by their largest possible parent.


And now for the code:


import std.stdio : writefln, File, stdin;
import std.conv : to;
import std.string : indexOf;

void main()
{
        string[] seqs;

        foreach( line; stdin.byLine() )
        {
                if( line[ 0 ] == '>' ) continue;
                else seqs ~= to!string( line );
        }

        foreach( i; 0 .. seqs.length )
        {
                if( seqs[ i ].length == 0 ) continue;

                foreach( j; i + 1 .. seqs.length )
                {
                        if( seqs[ j ].length == 0 ||

seqs[ i ].length == seqs[ j ].length) continue;


                        if( indexOf( seqs[ i ], seqs[ j ] ) > -1 )
                        {
                                seqs[ j ] = "";

writefln( "%s contains %s", i, j);

Compile the source and then run the executable via redirectingstdin:


./substr < seqs.fasta

See any straightforward optimization paths here?

For curiosity, I experimented with use of string[] and ubyte[][]and several functions (indexOf, canFind, countUntil) to assessfor the best potential performer. My off-the-cuff results:


string[] with indexOf() :: 26.5-27 minutes
string[] with canFind() :: >28 minutes
ubyte[][] with canFind() :: 27.5 minutes
ubyte[][] with countUntil() :: 27.5 minutes

Resultantly, the code above uses string[] with indexOf(). Testswere performed with an Intel(R) Xeon(R) Platinum 8259CL CPU @2.50GHz.

I have additional questions/concerns/confusion surrounding theforeach() syntax I have had to apply above, but performanceremains my chief immediate concern.

Re: 200-600x slower Dlang performance with nested foreach loop

Reply via email to