Short version:

I'm starting to suspect that the ncbi patch does not alter a code path related to -l gi filtering.


Long version:

I've been unable to persue this full time, but having gotten some encouragement from Aaron Darling off-list. So I'll add a bit more info to see if it rings any bells.


The blast on the master seems to be working way too hard. I've temporarily removed refseq_genomic, and am working only with the 'nt' database. As a result I no longer crash mpiblast as often. Despite that consession my Master still works for hours, while my workers sit idle 99% of time time. I've tracked this down to the master's call to
   runBLAST( COLLECT_STATS_MODE, 0, query_count - 1 );
in main().

I don't know the specifics, but I'd expect that COLLECT_STATS_MODE should be quick and shallow. Instead this task is taking 99% of my time.


Looking over into blast_hooks.c the COLLECT_STATS_MODE seems only to trigger:

options->calculate_statistics_and_exit = TRUE;

This in turn is a prominent feature of the MPIBlast NCBI patch, but never occurs in the unpatched NCBI code. So it seems one of the significant changes MPIBlast makes to ncbi blast, is to introduce this additional stats gathering mode.

I know from past discussions on this list I may be one of the few (only?) people here using blast's '-l' switch to filter out certain GIs from the database. And that this sort of filtering would affect the apparent database size, which would in turn affect the statistics.

So I'm starting to suspect that the ncbi patch does not alter a code path related to -l gi filtering. Ironically I don't really care all that much about exact statistics.


Knowing more is going to requiring digging into the ncbi toolbox. And 2am seems like a bad time to start doing that.



Michael Cariaso wrote:
at blast_hooks.c:1763 I see the following:


/* mpiBLAST: some NCBI functions do not use the
 * SeqMgr's bioseq fetch function (e.g. MuskSeqIdWrite)
 * In order to allow such functions to look up our bioseqs
 * we need to preload them into the SeqMgr
 * This should probably get reported as a bug to NCBI...
 */
 prune = BlastPruneHitsFromSeqAlign(curr_seqalign,
                       number_of_alignments, NULL);

 curr_seqannot->data = prune->sap;
 indexSubjectBioseqs(curr_seqannot->data);



I've determined that this loop is running on the master, while my worker nodes are doing the heavy lifting. For a particular query (I've found many such) the master will crash due to a failed memory allocation several layers down below the indexSubjectBioseqs() in the ncbi codebase. In fairness its trying to alloc 230M, so I can't blame it. It does this after printing the header, and table of scores. But prior to the alignments.

I've noticed on previous passes through the loop lots of swap activity here. I would guess that the preloading for SeqMgr is the source, but maybe someone here can help me to better understand.

Since some queries trigger this, and others don't I'm guessing that the issue may be related to the specific hits. My blastable DB has all of refseq_genomic and all of nt. Some of the seqs in refseq_genomic are large single segments (250M or more).


Could these be the origin of the problem?
How might this be fixed / avoided?
Has this been reported to NCBI?
I see there have been several revisions of the toolbox since mpiblast was last synchronized with it. Does this suggest any possibilities?



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Mpiblast-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mpiblast-users




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Mpiblast-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mpiblast-users

Reply via email to