Hi Intikhab,

The 14 minute benchmark time you got for searching the 300kb erwinia 
data set against 14GB of nt is in line with my own experiments, and the 
output size sounds approximately correct.  At this point, it's clear 
that some particular aspect of your fungal amino acid dataset is 
triggering a bottleneck in mpiblast.  Just to clarify, are you using 
vanilla mpiblast 1.4.0, or are you using mpiBLAST/pio which does 
parallel I/O?  If the output file for your amino acid data is 
particularly large and you are using vanilla mpiblast, the bottleneck 
may be at the mpiblast writer process.  In that case, you may see a 
large speedup by switching to mpiBLAST/pio and an appropriate parallel 
filesystem.  If that's not an option for some reason, you might consider 
further limiting the number of results generated for any given query 
sequence using -b 10 -v 10, where 10 is the number of hits to show 
results for. 

While it is possible to force mpiblast to do strictly query 
segmentation, it's not clear whether that will provide benefit in your 
situation.  To do it, you would format the database into a single 
fragment and then set --db-replicate-count=N where N is the number of 
worker nodes (38 in your case).  For vanilla (not pio) mpiblast 1.4.0, 
the output would still get sent to a single writer process, so if your 
search is extremely output-bound, it may still suffer.  With pure query 
segmentation, it's possible that the writer process may run more 
efficiently, but I haven't ever benchmarked it so can't say for sure.  
Note that the current mpiblasts aren't smart enough to skip the search 
space calculation when pure query segmentation happens, but it should be 
a fairly trivial modification to mpiblast.cpp...

-Aaron


Intikhab Alam wrote:
> Sorry for the cross-posting, my earliar email to mpiblast mailing list 
> got returned perhaps due to attachments.
>
>
> Hi Aaron,
> Hi All,
>
> I have been struggling to run mpiblast for the dataset of proteins I
> have from 36 fungal genomes. I wanted to compare all proteins to each
> other, this is usually called 'all-against-all blast'. Simple blastp
> on this dataset, chopped into 36 fragments, takes around 24 hours on
> our cluster which is running Score (version 5.8.4.r3) as an mpi
> environment and have 25 nodes, each with 4 cores and 8 GB of RAM.
>
> On the same cluster when I run mpiblast on 38 fragments using 40
> nodes, even after 12 days I only get 22% of the total estimated output
> (I know the complete output from earlier standard blast run).
>
> Aaron suggested to test the following and seems It still did not solve
> the problem. I tested the following suggestions from Aaron:
>
> 1) --db-replicate-count=5 option so that mpiblast will distribute a
> five copies of the database across worker nodes and this may enhance
> the performance.
>
> With this option, I checked the results after 4 days and I merely get
> 6% of the output.
>
> 2) mpiblast is optimized for mpich or lam-mpi so it may be useful to
> test mpiblast with these mpi implementations
>
> we have got gnu mpich installed and run mpiblast for our dataset
> with --debug as below:
>
>   
>> mpiformatdb -i 36FungalJGIanigNbcin_M40 --nfrags=38 -p
>>     
> T --skip-reorder
>
>   
>> /usr/local/sge6.0/bin/lx24-amd64/qsub -pe mpich 10
>>     
> /usr/local/mpich-GNU64/bin/mpirun -arch LINUX -machinefile
> /usr/local/mpich-GNU64/share/machines.LINUX -np 40 -nodes
> 10 -nolocal -allcpus -v
> /software/man2/manchester/mpiblast/tool/bin/mpiblast --debug -p
> blastp -d 36FungalJGIanigNbcin_M40 -i
> /scratch/man2/nwmcixa/mpiblast/work/36FungalJGIanigNbcin_M40 -m 8 -e
> 1e-5 -o
> /scratch/man2/nwmcixa/mpiblast/work/36FungalJGIanigNbcin_M40_mpi.out
>
>
> It did not produce any output, STDOUT, as --debug options was used, is
> shown at:
> http://www.bioinf.manchester.ac.uk/~alam/xxxxxx/mpiblast/mpirun_36genomes.e29468
>
>
>
> 3) 2) benchmark dataset (300K erwinia query set against the first 14GB
> of nt using blastn)
>
> mpiblast was run on the benchmark dataset and It produced 1.07Mb of 
> output (shown at 
> http://www.bioinf.manchester.ac.uk/~alam/xxxxxx/mpiblast/echry_In_nt14G_m8.out
>  ) 
> within 14 minutes, the STDOUT is shown at:
> http://www.bioinf.manchester.ac.uk/~alam/xxxxxx/mpiblast/mpirun_benchmark.e29467
>
>
> I am not sure whether this run produced the complete output as
> compared to the standard mpiblast test on the benchmark dataset by
> Aaron Darling.
>
>
> I am not sure what is the cause of slow mpiblast performance on my
> dataset.
>
> Is it possible to produce a version of mpiblast that splits the query
> dataset instead of the search|target dataset when the sizes of the
> query and target dataset are the same? This way there will be no need
> to correct the e-value calculation and the output of each fragmented
> query against the complete dataset can be concatenated to get the
> complete results. Running standard blast in this way produces the
> output in around 24 hours.
>
> I hope to get some help to resolve mpiblast on the dataset I have.
>
> Many Thanks,
>
> Intikhab
>
>
>
> --
> Dr. Intikhab Alam
> Research Associate
> School of Computer Science
> University of Manchester
> LF1, Kilburn Building,
> Oxford Road Manchester, M13 9PL
> United Kingdom
> http://www.cs.man.ac.uk/~ialam
>
>
>
>
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Mpiblast-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/mpiblast-users
>   


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Mpiblast-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mpiblast-users

Reply via email to