Greetings all,

Many thanks for sharing your collective perspective and advice thus far! It has been very helpful and instructive. I return bearing live data and a minimally complete, compilable, and executable program to experiment with and potentially optimize. The dataset can be pulled from here:

Running "cksum" on this file:

1477520542 2199192 seqs.fasta.gz

Naturally, you'll need to gunzip this file. The decompressed file contains strings on every even-numbered line that have already been reduced to the unique de-duplicated subset, and they have already been sorted by descending length and alphabetical identity.

From my initial post, the focus is now entirely on step #4: finding/removing strings that can be entirely absorbed (substringed) by their largest possible parent.

And now for the code:

import std.stdio : writefln, File, stdin;
import std.conv : to;
import std.string : indexOf;

void main()
        string[] seqs;

        foreach( line; stdin.byLine() )
                if( line[ 0 ] == '>' ) continue;
                else seqs ~= to!string( line );

        foreach( i; 0 .. seqs.length )
                if( seqs[ i ].length == 0 ) continue;

                foreach( j; i + 1 .. seqs.length )
                        if( seqs[ j ].length == 0 ||
seqs[ i ].length == seqs[ j ].length ) continue;

                        if( indexOf( seqs[ i ], seqs[ j ] ) > -1 )
                                seqs[ j ] = "";

writefln( "%s contains %s", i, j );

Compile the source and then run the executable via redirecting stdin:

./substr < seqs.fasta

See any straightforward optimization paths here?

For curiosity, I experimented with use of string[] and ubyte[][] and several functions (indexOf, canFind, countUntil) to assess for the best potential performer. My off-the-cuff results:

string[] with indexOf() :: 26.5-27 minutes
string[] with canFind() :: >28 minutes
ubyte[][] with canFind() :: 27.5 minutes
ubyte[][] with countUntil() :: 27.5 minutes

Resultantly, the code above uses string[] with indexOf(). Tests were performed with an Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz.

I have additional questions/concerns/confusion surrounding the foreach() syntax I have had to apply above, but performance remains my chief immediate concern.

