Now I am using parallel (in fact, sem) to run samtools and other next
generation sequencing analysis.
Some things are quite similar as this blog described:
http://zvfak.blogspot.com/2012/02/samtools-in-parallel.html
But I like use sem in such way:
export PRO="${HOME}/projects/2012-03-09_H3K4me3"
> export RESULT="${PRO}/result/ngs.plot/2012-03-15"
> export DATA="${PRO}/data/reheader"
> mkdir -p ${RESULT}
>
> INPUTS=("Sample_H" "Sample_G")
>
> # setup tagdirectory of inputs
> for INPUT in ${INPUTS[@]};do
> sem -j4 samtools rmdup -s ${DATA}/${INPUT}.bam
> ${RESULT}/${INPUT}_rmdup.bam
> done
>
> TREATS=("Sample_D" "Sample_E" "Sample_F")
> for TREAT in ${TREATS[@]};do
> sem -j4 samtools rmdup -s ${DATA}/${TREAT}.bam
> ${RESULT}/${TREAT}_rmdup.bam
> done
> sem -w
>
But I met some problems as when the load of the server heavy, then the
output of the sem sometimes will lose output randomly. At the step there is
no error report in the log files. The further processing then report and I
feel quite trouble to trace back the problem because I didn't get clear
clue about it. For example, a line of perfect bed file may should be:
chr1 1000 2000 tag1 256 +
and the result I found the line that was the output of sem subthread is :
chr
I know I'd better provide clear steps and example to repeat such problem,
but it is quite randomly and what I only know is that it highly relates to
high I/O intensive operations, especially when I use pipe. Does any body
also meet such problem? Or I am the only one met such problem?
Are there some tips that I could find the error early and trace the
problem? I read the man page of sem, but no clear clues about such problem,
and I also search the mail-archive.com about parallel, no clear solution
for it.
Thank you for my trivial problem.
Best,
Ning-Yi SHAO