G'day Daniel,

At the moment we don't have a peer-reviewed journal publication 
describing how mpiBLAST deals with computing the e-value statistics.  
The most precise description of how it works is of course the mpiBLAST 
code itself and the associated patch to the NCBI Toolbox, although I'll 
save you the time and trouble of having to muck around in the code by 
summarizing the important bits :)

BLAST e-value statistics represent the expected number of times one 
would see a hit with a particular bit-score by chance in a random 
database of the same size as the target database.  Several assumptions 
about models of evolution and other factors go into the score 
calculation, many of the model's assumptions about evolution are 
frighteningly simplistic but the statistics seem to work reasonably well 
in practice.  If you're interested in that aspect I'll refer you to the 
many papers written by Karlin and Altschul.  The scoring model details 
are mostly irrelevant to mpiBLAST because it uses the NCBI BLAST code to 
do all the hit scoring and e-value computation.  The 1.4.0 release of 
mpiBLAST has the rank 0 MPI process compute the effective query and 
database lengths which are used for e-value calculation prior to 
beginning the parallel search.  The "effective" length of a sequence 
represents the total amount of sequence remaining after blast has 
performed low-complexity sequence filtering using the dust algorithm or 
something similar.  It is assumed that the rank 0 process has access to 
the complete database so that it can calculate the correct effective 
lengths for the entire DB.  The rank 0 mpiblast process calls functions 
in the NCBI Toolbox to filter the sequences and calculate the effective 
search space without actually performing the search.  Once effective 
query and database lengths have been calculated by rank 0, the values 
are MPI_broadcast() to the rest of the processes.  Those values are then 
used by worker processes in place of values computed on individual 
database fragments.
This older e-mail may also be relevant:
http://bioinformatics.org/pipermail/bioclusters/2005-January/002173.html

The effective search space calculations are a serial component of the 
mpiBLAST 1.4.0 implementation.  For some workloads, the search space 
calculation can be rather time-consuming, making it an excellent target 
for parallelization in future mpiBLAST versions...
See also this discussion about using an e-value approximation to get 
around the time-consuming serial part of the 1.4.0 implementation:
http://www.mail-archive.com/[email protected]/msg00175.html
http://www.mail-archive.com/[email protected]/msg00177.html

Hope that helps,
-Aaron



Daniel Xavier de Sousa wrote:
>
> Hi for all,
>
>  
>
> Please, I’m studding about statistics BLAST and fragmented database 
> for blast.
>
> I have read some papers of MPIBlast, but I didn’t find out anything 
> that explains "HOW the MPIBlast gets exact e-value statistics of NCBI 
> BLAST?".
>
>  
>
> Please, where can I read/study about this?
>
>  
>
> Thanks
>
> Daniel
>
> *****************************************************************
> * Daniel Xavier de Sousa *
> * Mestrando em Informática - PUC-Rio *
> * E-MAIL : dsousaARROBAinf.puc-rio.br *
> * Fone : +55 21 35271500 - 4543 *
> ****************************************************************


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Mpiblast-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mpiblast-users

Reply via email to