G'day Daniel, At the moment we don't have a peer-reviewed journal publication describing how mpiBLAST deals with computing the e-value statistics. The most precise description of how it works is of course the mpiBLAST code itself and the associated patch to the NCBI Toolbox, although I'll save you the time and trouble of having to muck around in the code by summarizing the important bits :)
BLAST e-value statistics represent the expected number of times one would see a hit with a particular bit-score by chance in a random database of the same size as the target database. Several assumptions about models of evolution and other factors go into the score calculation, many of the model's assumptions about evolution are frighteningly simplistic but the statistics seem to work reasonably well in practice. If you're interested in that aspect I'll refer you to the many papers written by Karlin and Altschul. The scoring model details are mostly irrelevant to mpiBLAST because it uses the NCBI BLAST code to do all the hit scoring and e-value computation. The 1.4.0 release of mpiBLAST has the rank 0 MPI process compute the effective query and database lengths which are used for e-value calculation prior to beginning the parallel search. The "effective" length of a sequence represents the total amount of sequence remaining after blast has performed low-complexity sequence filtering using the dust algorithm or something similar. It is assumed that the rank 0 process has access to the complete database so that it can calculate the correct effective lengths for the entire DB. The rank 0 mpiblast process calls functions in the NCBI Toolbox to filter the sequences and calculate the effective search space without actually performing the search. Once effective query and database lengths have been calculated by rank 0, the values are MPI_broadcast() to the rest of the processes. Those values are then used by worker processes in place of values computed on individual database fragments. This older e-mail may also be relevant: http://bioinformatics.org/pipermail/bioclusters/2005-January/002173.html The effective search space calculations are a serial component of the mpiBLAST 1.4.0 implementation. For some workloads, the search space calculation can be rather time-consuming, making it an excellent target for parallelization in future mpiBLAST versions... See also this discussion about using an e-value approximation to get around the time-consuming serial part of the 1.4.0 implementation: http://www.mail-archive.com/[email protected]/msg00175.html http://www.mail-archive.com/[email protected]/msg00177.html Hope that helps, -Aaron Daniel Xavier de Sousa wrote: > > Hi for all, > > > > Please, I’m studding about statistics BLAST and fragmented database > for blast. > > I have read some papers of MPIBlast, but I didn’t find out anything > that explains "HOW the MPIBlast gets exact e-value statistics of NCBI > BLAST?". > > > > Please, where can I read/study about this? > > > > Thanks > > Daniel > > ***************************************************************** > * Daniel Xavier de Sousa * > * Mestrando em Informática - PUC-Rio * > * E-MAIL : dsousaARROBAinf.puc-rio.br * > * Fone : +55 21 35271500 - 4543 * > **************************************************************** ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Mpiblast-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/mpiblast-users
