The query test set is on the mpiblast download archive: http://www.mpiblast.org/Downloads.Archive.html
Specifically, you're after the 300kb of e. chrysanthemi predicted ORFs: http://www.mpiblast.org/downloads/files/e.chrysanthemi.fas As for the nt database, you'll have to download it from NCBI: ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz and siphon off the first 14GB (uncompressed) with dd or something similar. It may not be identical to what I used in 2005, but it should be close enough for a cursory runtime check. For extra points, try using mpiformatdb's ability to read uncompressed fasta databases from stdin. that should allow you to build a series of unix pipes that saves plenty of disk I/O. -Aaron ialam wrote: > Hi Aron, > > I wanted to have the benchmark dataset so that I could test mpiblast > performance. Could you please point me to the dataset. In the meantime > I am trying to get mpich running on the cluster. > > > Many Thanks, > > Intikhab > ----- Original Message ----- From: "intikhab alam" <[EMAIL PROTECTED]> > To: "Aaron Darling" <[EMAIL PROTECTED]> > Sent: Friday, March 02, 2007 1:20 PM > Subject: Re: [Mpiblast-users] blast in 1 day but could not get > mpiblast done even in 10 days for the same dataset > > >> Hi Aron, >> >> I would like to try out the benchmark dataset, could you point me from >> where I could download this? >> >> >> Intikhab >> ----- Original Message ----- From: "Aaron Darling" <[EMAIL PROTECTED]> >> To: "intikhab alam" <[EMAIL PROTECTED]> >> Sent: Friday, March 02, 2007 6:21 AM >> Subject: Re: [Mpiblast-users] blast in 1 day but could not get >> mpiblast done even in 10 days for the same dataset >> >> >> : It sounds like there must be something causing an mpiblast-specific >> : communications bottleneck in your system. Anybody else have ideas >> : here? If you're keen to verify that, you could run mpiblast on the >> : benchmark dataset we were using on Green Destiny and compare >> runtimes. >> : My latest benchmark data set (dated June 2005) has a runtime of >> about 16 >> : minutes for 64 nodes to search the 300K erwinia query set against >> the >> : first 14GB of nt using blastn. Each compute node in that machine >> was a >> : 667Mhz transmeta chip, 640MB ram, connected via 100Mbit ethernet. I >> was >> : using mpich2-1.0.1, no SCore. Based on paper specs, your cluster >> should >> : be quicker than that. >> : >> : On the other hand, if you've got wild amounts of load imbalance, >> : --db-replicate-count=5 may not be enough, and 41 may prove ideal >> (where >> : 41 = the number of nodes in your cluster). In that case, mpiblast >> will >> : have effectively copied the entire database to each node, totally >> : factoring out load imbalance from the compute time equation. Your >> : database is much smaller than each node's core memory, and a single >> : fragment is probably much larger than each node's CPU cache, so I >> can't >> : think of a good reason not to fully distribute the database, apart >> from >> : the time it takes to copy DB fragments around. >> : >> : In any case, keep me posted if you discover anything. >> : >> : -Aaron >> : >> : >> : intikhab alam wrote: >> : > Hi Aaron, >> : > >> : > As per your suggestion, I used the following option: >> : > >> : > --db-replicate-count=5 >> : > >> : > assuming it may help reach the 24hrs mark to complete the job. >> : > However, I see that only 6% of the (total estimated) output has >> been >> : > generated until now(i.e after 4 days (4*24 hrs). If I continue >> this >> : > way, my mpiblast would finish in 64 days. Any other suggestion to >> : > improve the running time? >> : > >> : > Intikhab >> : > ----- Original Message ----- : > From: "Aaron Darling" >> <[EMAIL PROTECTED]> >> : > To: "intikhab alam" <[EMAIL PROTECTED]>; >> : > <[email protected]> >> : > Sent: Wednesday, February 21, 2007 1:33 AM >> : > Subject: Re: [Mpiblast-users] blast in 1 day but could not get >> : > mpiblast done even in 10 days for the same dataset >> : > >> : > >> : > : Hi Intikhab... >> : > : >> : > : intikhab alam wrote: >> : > : > : can take a long time to compute the effective search space >> : > required >> : > : > for >> : > : > : exact e-value calculation. If that's the problem, then you >> : > would >> : > : > find >> : > : > : just one mpiblast process consuming 100% cpu on the rank 0 >> node >> : > for >> : > : > : hours or days, without any output. >> : > : > >> : > : > Is the effective search space calculation done on the master >> node? >> : > If >> : > : > yes, this mpiblast job stayed at the master node for some >> hours >> : > and >> : > : > then all the compute nodes got busy with >90% usage all the >> time >> : > with >> : > : > continued output being generated until the 12th day when I >> killed >> : > the >> : > : > job. >> : > : > >> : > : >> : > : yes, the search space calculation is done on the master node and >> it >> : > : sounds like using the --fast-evalue-approximation command-line >> : > switch >> : > : would save you a few hours, which is pretty small compared to >> the >> : > weeks >> : > : or months that the rest of the search is taking. >> : > : >> : > : > : >> : > : > : The more likely limiting factor is load imbalance on the >> : > cluster. >> : > : > >> : > : > >> : > : > In this case, do you think the job should finish on some nodes >> : > earliar >> : > : > than others? In my case job was running on all the nodes with >> >90% >> : > : > usage and the last output I got was on the last day when I >> killed >> : > the >> : > : > job. >> : > : > >> : > : It's possible the other nodes may continue running mpiblast >> workers >> : > : which are waiting to send results back to the mpiblast writer >> : > process. >> : > : >> : > : > : If some database fragments happen to have a large number of >> hits >> : > and >> : > : > : others have few, and the database is distributed as one >> fragment >> : > per >> : > : > : node, then the computation may be heavily imbalanced and may >> run >> : > : > quite >> : > : > : slowly. CPU consumption as given by a CPU monitoring tool >> may >> : > not >> : > : > be >> : > : > : indicative of useful work being done on the nodes since >> workers >> : > can >> : > : > do a >> : > : > : timed spin-wait for new work. >> : > : > : I can suggest two avenues to achieve better load balance >> with >> : > : > mpiblast >> : > : > : 1.4.0. First, partition the database into more fragments, >> : > possibly >> : > : > two >> : > : > : or three times as many as you currently have. Second, use >> the >> : > : > >> : > : > You mean more fragments that inturn means to use more nodes? >> : > Actually >> : > : > at our cluster not more than 44 nodes are allowed for the >> parallel >> : > : > jobs. >> : > : > >> : > : no, it's not necessary to run on more nodes when creating more >> : > : fragments. mpiblast 1.4.0 needs at least as many fragments as >> nodes >> : > : when --db-replicate-count=1 (the default value). >> : > : when there are more fragments than nodes, mpiblast will happily >> : > : distribute the extra fragments among the nodes. >> : > : >> : > : > : --db-replicate-count option to mpiblast. The default value >> for >> : > the >> : > : > : db-replicate-count is 1, which indicates that mpiblast will >> : > : > distribute a >> : > : > : single copy of your database across worker nodes. For your >> : > setup, >> : > : > each >> : > : > : node was probably getting a single fragment. By setting >> : > : > >> : > : > >> : > : > Is it not right if each single node gets a single fragment of >> the >> : > : > target database (the number of nodes assigned for mpiblast = >> : > number of >> : > : > fragments+2) so that the whole query dataset could be searched >> : > against >> : > : > the fragment (effective search space calculation being done >> before >> : > : > starting the search for blast comparable evalues) on each >> single >> : > node? >> : > : > >> : > : the search space calculation happens on the rank 0 process and >> : > totally >> : > : unrelated to the number of nodes and number of DB fragments. >> The >> : > most >> : > : basic mpiblast setup has one fragment per node, but when >> : > load-balancing >> : > : is desirable, as in your case, mpiblast can be configured to use >> : > : multiple fragments per node. This will not affect the e-value >> : > calculation. >> : > : >> : > : > >> : > : > : --db-replicate-count to something like 5, each fragment >> would be >> : > : > copied >> : > : > : to five different compute nodes, and thus five nodes would >> be >> : > : > available >> : > : > : to search fragments that happen to have lots of hits. In >> the >> : > : > extreme >> : > : > >> : > : > You mean this way nodes would be busy searching the query >> dataset >> : > : > against the same fragment on 5 compute nodes? Is this just a >> way >> : > to >> : > : > keep the nodes busy until all the nodes complete the searches? >> : > : > >> : > : Yes, this will balance the load and will probably speed up your >> : > search. >> : > : >> : > : > : case you could set --db-replicate-count equal to the number >> of >> : > : > : fragments, which would be fine if per-node memory and disk >> space >> : > is >> : > : > : substantially larger than the total size of the formatted >> : > database. >> : > : > : >> : > : > >> : > : > Is it possible in mpiblast that for cases where the size of >> the >> : > query >> : > : > dataset is equal to the size of target dataset, the query >> dataset >> : > : > should be fragmented, the target dataset should be kept in the >> : > : > global/shared area and searches are done on single nodes (the >> : > number >> : > : > of nodes equal to the number of query dataset fragments) and >> this >> : > way >> : > : > there would be no need to calculate the effective search space >> as >> : > all >> : > : > the search jobs get the same size of the target dataset? by >> : > following >> : > : > this way I managed to complete this job using standard blast >> in < >> : > : > 24hrs. >> : > : > >> : > : The parallelization approach you describe is perfectly >> reasonable >> : > when >> : > : the total database size is less than core memory size on each >> node. >> : > : With a properly configured --db-replicate-count, I would guess >> that >> : > : mpiblast could approach the 24 hour mark, although may take >> slightly >> : > : longer since there are various overheads involved with copying >> of >> : > : fragments and serial computation of the effective search space. >> : > : >> : > : >> : > : > : >> : > : > : In your particular situation, it may also help to randomize >> the >> : > : > order of >> : > : > : sequences in the database to minimize "fragment hotspots" >> which >> : > : > could >> : > : > : result from a database self-search. >> : > : > >> : > : > I did not get the "fragment hotspots" bit here. By randomizing >> the >> : > : > order of sequence you mean each node would possibly take >> similar >> : > time >> : > : > to finish the searches? Otherwise it could be possible that >> the >> : > number >> : > : > of hits could be lower for some fragments than others and this >> : > ends up >> : > : > in different times for the job completion on different nodes? >> : > : > >> : > : Right, the goal is to get the per-fragment search time more >> balanced >> : > : through randomization. But after thinking about it a bit more, >> i'm >> : > not >> : > : sure just how much this would save.... >> : > : >> : > : > >> : > : > : At the moment mpiblast doesn't have >> : > : > : code to accomplish such a feat, but I think others (Jason >> Gans?) >> : > : > have >> : > : > : written code for this in the past. >> : > : > >> : > : > Aaron, do you think Score based mpi communication may be >> delaying >> : > the >> : > : > overall time in running mpiblast searches? >> : > : > >> : > : It's possible. >> : > : The interprocess communication in 1.4.0 was fine-tuned for >> default >> : > : mpich2 1.0.2 and lam/mpi implementations. We use various >> : > combinations >> : > : of the non-blocking MPI_Issend(), MPI_Irecv(), and the blocking >> : > : send/recv api in mpiblast 1.4.0. I have no idea how it would >> : > interact >> : > : with SCore. >> : > : >> : > : -Aaron >> : > : >> : > : >> : > >> : >> : >> ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Mpiblast-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/mpiblast-users
