Re: Parallel with sed group capture

Carlos Pérez Cantalapiedra Wed, 08 May 2013 06:45:59 -0700

thank you Matt.

About escape character I have another related question. Why do I need to
escape the dollars here?


cat sbcc073_pcm_ill_all.musket_default.fastq | head -8 | parallel --pipe
"perl -lne 'if($.%4==1){s/^(@.*)_([12]).*/\$1\/\$2/;print}' "

Regarding the starting "@", this is one of the things I don't like of
working with FastQ format. When doing an AWK script or perl/python one I
can manage that easy, just by number_of_line%4==1 for example. However, I
don't know how to control that from sed expression. Of course, a more
reliable way to avoid matching qual strings is adding some more characters
from identifier (eg: @HUWSI...) but this makes the scripts less general...
any other advice would be very welcome.

best,
Carlos


2013/5/8 Matt Oates (Home) <[email protected]>

> Dear Carlos,
>
> You just need to quote the sed command so:
>
> cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | parallel
> --pipe sed 's#^\(@.*\)_\([12]\).*#\1/\2#'
>
> becomes:
>
> cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | parallel
> --pipe "sed 's#^\(@.*\)_\([12]\).*#\1/\2#'"
>
>
> You might also want to define how the FASTQ records are separated
> which is problematic if you have reads from anything other than
> Illumina 1.5+ since the quality score can include @ symbols at the
> start of a line. You could do something like the following to split
> the pipe so that whole FASTQ records go to each job:
>
> parallel --recstart='^@' --pipe "sed 's#^\(@.*\)_\([12]\).*#\1/\2#'"
>
> something like the following might be more appropriate though:
>
> parallel -N 4 --pipe "sed 's#^\(@.*\)_\([12]\).*#\1/\2#'"
>
> This will tell parallel to take 4 lines at a time per job, just
> increase to multiples of four to send that number of FASTQ records to
> each job. Obviously with your current sed it doesn't actually matter
> that you have one FASTQ record per job but it might be important in
> the future.
>
> Best Wishes,
> Matt.
>
> ---
> http://blog.mattoates.co.uk
> http://www.mattoates.co.uk
>
>
> On 8 May 2013 11:01, Carlos Pérez Cantalapiedra
> <[email protected]> wrote:
> > Hello everyone,
> >
> > I am new to this list and to the parallel command. I hope answer to next
> > question is not too obvious, but enough to get some advice :)
> >
> > I have to process a big file, and have been reading about parallel
> command
> > to try to use more than 1 core processor when using sed, sort and so on.
> So
> > I first wanted to change first line of every four (because of naming
> > conventions of this kind of file - FastQ format).
> >
> > For example, this would be a group of four, and I want to modify the
> first
> > line:
> >
> >     cat sbcc073_pcm_ill_all.musket_default.fastq | head -4
> >
> >     @HWUSI-EAS1752R:29:FC64CL3AAXX:8:65:16525:4289_1:N:0:ACTTGA
> >     GCGAGAGAAT
> >     +
> >     GHHHHHHHHHH
> >
> > With the next command I have the work done:
> >
> >     cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | sed
> > 's#^\(@.*\)_\([12]\).*#\1/\2#'
> >
> >     @HWUSI-EAS1752R:29:FC64CL3AAXX:8:65:16525:4289/1
> >     GCGAGAGAAT
> >     +
> >     GHHHHHHHHHH
> >
> > However, when using parallel it seems that is not recognizing the group
> > capture brackets:
> >
> >     cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | parallel
> --pipe
> > sed 's#^\(@.*\)_\([12]\).*#\1/\2#'
> >
> >     @HWUSI-EAS1752R:29:FC64CL3AAXX:8:65:16525:4289_1:N:0:ACTTGA
> >     GCGAGAGAAT
> >     +
> >     GHHHHHHHHHH
> >
> > When removing backslashes or using sed -r the command is telling me:
> >
> >     /bin/bash: -c: line 3: syntax error near unexpected token `('
> >     /bin/bash: -c: line 3: `             (cat /tmp/60xrxvCIRX.chr; rm
> > /tmp/60xrxvCIRX.chr; cat - ) | (sed s#^(@.*)_([12]).*#\1/\2# );'
> >
> > Could anyone put some light on this?
> >
> > thank you very much
>

Re: Parallel with sed group capture

Reply via email to