Hi Ole,
after investigating, it looks like the 15 sequences ID missing where just
filtered by the evalue filter (the corresponding sequences where very
short).
So far my commands for gnuparallel + blast under LSF are:
---------------------------------------------------------------------------------------------------------------
cat queryFile.fasta | parallel --no-notice -j LSFSLOTS --tmpdir
/network_filesystem_partition/tmp --wait --slf serversFile --block 200k
--recstart '>' --pipe blastp -evalue 1e-05 -outfmt 6 -db dbFile -query -
-out resultFile_{#}
wait
---------------------------------------------------------------------------------------------------------------
- I specified a shared folder as tmpdir since it looks like gnuparallel
uses local /tmp and I am not sure if that's master node's /tmp (which can't
be read by other nodes) or not
- I used the final wait to be sure LSF releases worker nodes only after
every blast is finished (Do you think I should add a --fg also/instead?)
Anyway now my main concern is how do I specify the maximum job number per
host when I am using multiple worker nodes?
best,
giuseppe
On Tue, Apr 21, 2015 at 8:00 PM, Giuseppe Aprea <[email protected]>
wrote:
> Hi Ole,
>
> sorry for this late reply but our cluster had to undergo maintenance.
> I have some notices/questions, please.
>
> *Remote nodes.* LSF just reserve slots on several remote servers and
> launch your command line on one of those remote servers which we can call
> the master node. LSF reserved nodes are written on a file whose path is in
> LSF evironment variable LSB_DJOB_HOSTFILE. As an example If LSF gives you 2
> slots non server_1 and 3 slots on server_2 this file is given by:
> server_1
> server_1
> server_2
> server_2
> server_2
> LSF slots should corresponds to server cores. That doesnt' mean LSF is
> able to enforce the number of program instances. That mus be done by users
> which may be given slots on the same server. Following LSF syntax, which is
> also similar to MPI hostfile syntax, I repeated the server names but you
> are saying that's useless. My question is: *(Q1)* How do I specify the
> maximum job number per host? Is it something like (following prevoius
> example)
> 2/server_1
> 3/server_2
>
> *Empty result files.* I guess I retrieved empty results file for
> different reasons; one was, as you noticed, the wrong replacement string (
> {%} insted of {#} ) but I also had the wrong temporary directory(which must
> be on a shared filesystem in my case). Now I think I reached a good point
> with the following script:
>
> #!/bin/bash
>
>
> #BSUB -J gnuParallel_blast_test # Name of the job.
> #BSUB -o %J.out # Appends std output to file
> %J.out. (%J is the Job ID)
> #BSUB -e %J.err # Appends std error to file %J.err.
> #BSUB -q cresco3_h144 # Queue name.
> #BSUB -n 70 # Number of CPUs.
>
> module load 4.8.3/ncbi/12.0.0
> module load 4.8.3/parallel/20150122
>
> SLOTS=`cat ${LSB_DJOB_HOSTFILE} |wc -l`
>
> SERVER=""
>
> for i in `cat ${LSB_DJOB_HOSTFILE}| sort`
> do
> echo "${i}" >> servers
> done
>
> cat
> /gporq1_1M/usr/aprea/bio/solanum_melongena/analysis/orthomcl_00/goodProteins.fasta
> | parallel --no-notice -vv -j ${SLOTS} --tmpdir
> /gporq1_1M/usr/aprea/bio/solanum_melongena/analysis/orthomcl_00/tmp --wait
> --slf servers --block 200k --recstart '>' --pipe blastp -evalue 1e-05
> -outfmt 6 -db
> /gporq1_1M/usr/aprea/bio/solanum_melongena/analysis/orthomcl_00/goodProteins
> -query - -out
> /gporq1_1M/usr/aprea/bio/solanum_melongena/analysis/orthomcl_00/resultd_{#}
> wait
>
>
> server file generated at runtime was:
>
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x004.portici.enea.it
> cresco3x011.portici.enea.it
> cresco3x011.portici.enea.it
> cresco3x011.portici.enea.it
> cresco3x011.portici.enea.it
> cresco3x011.portici.enea.it
> cresco3x011.portici.enea.it
> cresco3x011.portici.enea.it
> cresco3x011.portici.enea.it
> cresco3x011.portici.enea.it
> cresco3x011.portici.enea.it
> cresco3x011.portici.enea.it
> cresco3x011.portici.enea.it
> cresco3x011.portici.enea.it
> cresco3x011.portici.enea.it
> cresco3x011.portici.enea.it
> cresco3x011.portici.enea.it
> cresco3x011.portici.enea.it
> cresco3x011.portici.enea.it
> cresco3x011.portici.enea.it
> cresco3x011.portici.enea.it
> cresco3x011.portici.enea.it
> cresco3x011.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
> cresco3x013.portici.enea.it
>
> (I had not read your message about repeated hostnames when I launched)
> this time the stderr seemed not too bad (just a few warnings):
> parallel: Warning: ssh to cresco3x004.portici.enea.it only allows for 0
> simultaneous logins.
> You may raise this by changing /etc/ssh/sshd_config:MaxStartups and
> MaxSessions on cresco3x004.portici.enea.it.
> Using only -1 connections to avoid race conditions.
> parallel: Warning: ssh to cresco3x013.portici.enea.it only allows for 0
> simultaneous logins.
> You may raise this by changing /etc/ssh/sshd_config:MaxStartups and
> MaxSessions on cresco3x013.portici.enea.it.
> Using only -1 connections to avoid race conditions.
> parallel: Warning: ssh to cresco3x011.portici.enea.it only allows for 0
> simultaneous logins.
> You may raise this by changing /etc/ssh/sshd_config:MaxStartups and
> MaxSessions on cresco3x011.portici.enea.it.
> Using only -1 connections to avoid race conditions.
> *(Q2) *Do you have any comments on that?
> I retrieved 348 result files (all of them non empty) and I cat-ed them on
> a single file. The problem now is that for this test I run an all vs all
> BLAST so I expect at least 1 hit for each sequence in the input (each
> sequence vs itself) . Unfortunately that is not the case:
>
> awk '{print $1}' resultd_all | sort | uniq | wc -l
> 175610
> egrep "^>" goodProteins.fasta |wc -l
> 175625
>
> As you can see I have 15 sequences ID missing. I am still investigating
> but I would like to ask you *(Q3) *if those IDs could have been lost in
> the data chunks creation (I used "-block 200k --recstart '>' --pipe") and,
> in case, how could I avoid that?
>
> This is the input file structure:
>
> head -n 12 goodProteins.fasta
> >tom|Solyc00g005000.2.1
> MFVPSIFLVFIMSCIISASVSYESKSTSGHAISFPTHEHLDVNQAIKEIIQPPETVHDNI
> NNIVDDDDDNSRWKLKLLHRDKLPFSHFTDHPHSFQARMKRDLKRVHTLTNTTTNDNNKV
> IKEEELGFGFGSEVISGMEQGSGEYFVRIGVGSPVRQQYMVIDAGSDIVWVQCQPCTHCY
> HQSDPVFDPSLSASFTGVPCSSSLCNRIDNSGCHAGRCKYQVMYGDGSYTKGTMALETLT
> FGRTVIRDVAIGCGHSNHGMFIGAAGGAFSYCLVSRGTNTGSTGSLEFGREVLPAGAAWV
> PLIRNPRAPSFYYIGMLGLGVGGVRVPIPEDAFRLTEEGDGGVVMDTGTAVTRLPHEAYV
> AFRDAFVAQTSSLPRAPAMSIFDTCYDLNGFVTVRVPTISFFLMGGPILTLPARNFLIPV
> DTKGTFCFAFAPSPSRLSIIGNIQQEGIQISIDGANGFVGFGPNIC*
> >tom|Solyc00g005020.1.1
> MYVICKCICIDILIYMLLKVVEEKPQKDKKRRASDRGVLAQSHENVTNTEMAQERNVNER
> LSRGRGITQHSQTSSEANCSGGVLGRGKRPAEHEDTSEGQTRPFKWPRMVGVGIYQAEDG
> .....
>
>
> Many thanks,
>
> giuseppe
>
>
>
> On Fri, Apr 17, 2015 at 6:28 PM, Ole Tange <[email protected]> wrote:
>
>> On Wed, Apr 15, 2015 at 3:34 PM, Giuseppe Aprea
>> <[email protected]> wrote:
>>
>> > I am trying to use GNU parallel v. 20150122 with blast for a very large
>> > sequences alignment. I am using Parallel on a cluster which uses LSF as
>> > queue system.
>>
>> I have never run anything on a LSF system, so take my advice with 1
>> mmol of NaCl.
>>
>> > "servers" is this file:
>> >
>> > /afs/enea.it/software/bin/blaunch.sh cresco3x013.portici.enea.it
>> > /afs/enea.it/software/bin/blaunch.sh cresco3x013.portici.enea.it
>> :
>>
>> Duplicate lines in a --slf file are merged. It does no harm to have
>> the duplicate lines, but the duplicates are simply merged into 1.
>>
>> > My problems are
>> :
>> > - the result files are empy and I can see the following messages:
>>
>> It has been a while since I used blastp. Does it append to the file
>> given in '-out'? If not then you are overwriting it for every 24
>> sequences. Maybe you meant {#} instead?
>>
>> > sh -c 'dd bs=1 count=1 of=/tmp/pariINik.chr 2>/dev/null'; test ! -s
>> > "/tmp/pariINik.chr" && rm -f "/tmp/pariINik.chr" && exec true; (cat
>> > /tmp/pariINik.chr; rm /tmp/pariINik.chr; cat - ) |
>> > (/afs/enea.it/software/bin/blaunch.sh cresco3x018.portici.enea.it exec
>> perl
>> > -e
>> >
>> \\\$ENV\\\{\\\"PARALLEL_PID\\\"\\\}=\\\"30669\\\"\\\;\\\$ENV\\\{\\\"PARALLEL_SEQ\\\"\\\}=\\\"687\\\"\\\;\\\$bashfunc\\\
>> > =\\\ \\\"\\\"\\\;@ARGV=\\\"blastp\\\ -evalue\\\ 1e-05\\\ -outfmt\\\ 6\\\
>> > -db\\\
>> >
>> /gporq1_1M/usr/aprea/bio/solanum_melongena/analysis/orthomcl_00/goodProteins_first_0010000\\\
>> > -query\\\ -\\\ -out\\\
>> >
>> /gporq1_1M/usr/aprea/bio/solanum_melongena/analysis/orthomcl_00/resultd_24\\\"\\\;\\\$SIG\\\{CHLD\\\}=sub\\\{\\\$done=1\\\;\\\}\\\;\\\$pid=fork\\\;unless\\\(\\\$pid\\\)\\\{setpgrp\\\;exec\\\$ENV\\\{SHELL\\\},\\\"-c\\\",\\\(\\\$bashfunc.\\\"@ARGV\\\"\\\)\\\;die\\\"exec:\\\$\\\!\\\\n\\\"\\\;\\\}do\\\{\\\$s=\\\$s\\\<1\\\?0.001+\\\$s\\\*1.03:\\\$s\\\;select\\\(undef,undef,undef,\\\$s\\\)\\\;\\\}until\\\(\\\$done\\\|\\\|getppid==1\\\)\\\;kill\\\(SIGHUP,-\\\$\\\{pid\\\}\\\)unless\\\$done\\\;wait\\\;exit\\\(\\\$\\\?\\\&127\\\?128+\\\(\\\$\\\?\\\&127\\\):1+\\\$\\\?\\\>\\\>8\\\););
>>
>> -vv is really only useful for debugging: It is extremely hard to read
>> - even if you are the author of GNU Parallel.
>>
>> I will highly recommend to use '-v' first and only resort to '-vv' if
>> '-v' shows what is expected.
>>
>>
>> /Ole
>>
>
>