Peter, 

Sorry for the confusion on the differences between pioblast and
mpiblast-pio. You can refer to the following archived message for some
explanation: 
http://sourceforge.net/mailarchive/forum.php?thread_id=9626643&forum_id=4368
9

Basically in the current release only parallel output is incorporated but
not parallel input. I am answering your questions in between lines.

> 1)  The README for mpiblast says you need to run mpiformatdb first
> (with args to specify frag size, number frags, etc).  However, your paper
> "Efficient Data Access for Parallel BLAST" states one of the goals of
> mpiblast-PIO was to avoid this and instead do dynamic partitioning.
> So exactly what is required in terms of prepartitioning?

Currently mpiblast-pio uses the same static partitioning as mpiblsat 1.4.
The dynamic partitioning technique has not been integrated yet.

> 
> 2)  I assume all the documentation that is relevant specifically to
> mpiblast-PIO is in the first 43 lines of your README, starting at
> "Changes between 1.4.0 and 1.4.0-pio test release".  Is that correct?

Yes, it is correct. 

> 
> 3)  Regarding Section 4 of your paper, you indicate that at the time the
> paper was written, pioBlast would not handle very large databases. Has
> that
> issue been addressed?  If so, what do I do to deal with NT?

The issue is related to the dynamic partitioning. Since the current version
of mpiblast-pio use static partitioning, you don't have to worry about this.

> 
> 4) mpiblast requires that 2 extra processes be allocated to deal with IO
> and
> master slave issues.  Does mpiblast-PIO still require extra processes or
> is
> all that transparent?

Mpiblast-pio uses one extra process (the master) to handle the scheduling. 

> 
> 5)  I will be running on a large cluster with 4 cpus per node, each node
> is
> diskless as we have a parallel file system.  How does your code know how
> big
> to make a fragment so that it is cache friendly?
> 
> Or do I specify the size of a frag?  If I specify it, are there any
> guidelines for
> determining how big to make a frag relative to total memory (all 4 cpus
> share
> the memory on a node.)

You have to specify either the frag size or the number of frags when
formatting the database with mpiformatdb. Regarding to the frag size, the
rule of thumb is to make sure the database frag is smaller enough so that
the sequence data and intermediate results can be cached into memory. One
suggestion would be making the frag size less than 80% of the memory size.
In your case, if the node has 2GB total memory and you want to run 4
mpiblast instances per node, then the frag size is recommended to be less
than 400MB. Another consideration is that according to our experiences,
making the number of DB frags the same as the number of mpiblast workers
(except the extra master process) normally delivers the best performance. 

Since you are using diskless nodes, another suggestion (not about the frag
size) is to skip DB distribution by setting identical Shared and Local paths
in the ".ncbirc" file. 

> 
> 6) Do you know if anyone at LLNL has any experience using mpiblast-PIO?

Unfortunately I am not aware of other mpiblast-pio users in LLNL.

> 
> 7) Regarding the --enable-mpi-atomicity flag, you say "Use this option
> if missing data
> is observed in the output file."  That sentence does not make sense.
> How can you observe
> missing data?  If you mean to run everything using regular blast and and
> also mpiblast-PIO
> and then compare the output, then that is a bit absurd.  We need correct
> results that we
> can rely on.  Is there a penalty or some disadvantages for using the
> atomicity flag?
> If not, they why not make it the default?

If the data missing happens, the output file would have the correct size but
with some contents filled with garbage data. For the text output formats, it
is easily detected by examining some portions of the output with eyes. For
the binary output formats, the garbage content should cause errors to the
applications that successively process the results. The bottom line is,
there won't be an error-undetectable case in which all the results in the
output file are correct but with some other legal results missed.  

This flag is not made default because the writing performance might be
substantially degraded when the output file is operated in the atomic mode
on some parallel file systems. But you raised a good point. In fact in my
experiments on the SGI XFS, the overall performance of mpiblast-pio doesn't
show significant differences when the atomic mode is enabled, probably
because the mpiblast-pio is not I/O bounded given the high computation cost
of current BLAST algorithm. I suggest doing some test runs to verify whether
enabling this flag will have considerable performance penalty on your
platform. If not you can simply make it a default option. 

In the future I think we can address this issue in two aspects. 1) Benchmark
mpiblast-pio on different parallel file systems to give users specific
system dependent guidelines. 2) Include a test program to help user verify
the result correctness.

> 
> 8) Is there a current paper published on mpiblast-PIO, (later than the
> one quoted above)?

So far the most relevant paper is still the pioblast one. 

> 
> 9) Am I missing something that answers my questions?
> 
> It looks like the mpiblast-PIO project is very, very valuable and I
> appreciate all the work
> you and your colleagues have done.  However, I believe it deserves some
> simple documentation
> on how to use it --- unless I'm missing something.
> 

Agree. We were trying to make the usage mpiblast-pio as close to that of
mpiblast 1.4, so that existing mpiblast users can use it with least learning
efforts. But it looks like the README is inadequate in addressing the
differences between mpiblast-pio and pioblast, as well as explaining its
specific features in detail. We would consider a detail usage document or a
paper on mpiblast-pio to help users get started. 

We appreciate all your comments and feedbacks as they are very valuable for
us to improve this project. Please report any problems/questions so we can
address them in our ongoing work.

Thanks,
Heshan



-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642
_______________________________________________
Mpiblast-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mpiblast-users

Reply via email to