Greetings all,
Many thanks for sharing your collective perspective and advice
thus far! It has been very helpful and instructive. I return
bearing live data and a minimally complete, compilable, and
executable program to experiment with and potentially optimize.
The dataset can be pulled from here:
https://filebin.net/qf2km1ea9qgd5skp/seqs.fasta.gz?t=97kgpebg
Running "cksum" on this file:
1477520542 2199192 seqs.fasta.gz
Naturally, you'll need to gunzip this file. The decompressed file
contains strings on every even-numbered line that have already
been reduced to the unique de-duplicated subset, and they have
already been sorted by descending length and alphabetical
identity.
From my initial post, the focus is now entirely on step #4:
finding/removing strings that can be entirely absorbed
(substringed) by their largest possible parent.
And now for the code:
import std.stdio : writefln, File, stdin;
import std.conv : to;
import std.string : indexOf;
void main()
{
string[] seqs;
foreach( line; stdin.byLine() )
{
if( line[ 0 ] == '>' ) continue;
else seqs ~= to!string( line );
}
foreach( i; 0 .. seqs.length )
{
if( seqs[ i ].length == 0 ) continue;
foreach( j; i + 1 .. seqs.length )
{
if( seqs[ j ].length == 0 ||
seqs[ i ].length == seqs[ j ].length
) continue;
if( indexOf( seqs[ i ], seqs[ j ] ) > -1 )
{
seqs[ j ] = "";
writefln( "%s contains %s", i, j
);
}
}
}
}
Compile the source and then run the executable via redirecting
stdin:
./substr < seqs.fasta
See any straightforward optimization paths here?
For curiosity, I experimented with use of string[] and ubyte[][]
and several functions (indexOf, canFind, countUntil) to assess
for the best potential performer. My off-the-cuff results:
string[] with indexOf() :: 26.5-27 minutes
string[] with canFind() :: >28 minutes
ubyte[][] with canFind() :: 27.5 minutes
ubyte[][] with countUntil() :: 27.5 minutes
Resultantly, the code above uses string[] with indexOf(). Tests
were performed with an Intel(R) Xeon(R) Platinum 8259CL CPU @
2.50GHz.
I have additional questions/concerns/confusion surrounding the
foreach() syntax I have had to apply above, but performance
remains my chief immediate concern.