Re: Parallel with sed group capture

Matt Oates (Home) Wed, 08 May 2013 06:54:00 -0700

Dear Carlos,

If it was me doing this I would use a full Perl script instead of some
oneliner with parallel. The issues with quoting and shell escapes are
more problematic with parallel so I would just bypass them completely.
Additionally I would parse the file using Bio::SeqIO
(http://doc.bioperl.org/bioperl-live/Bio/SeqIO/fastq.html) then write
out the edited ID versions using SeqIO too. Then you can just make a
general perl script to work with GNU parallel that edits IDs of any
file format by just wrapping SeqIO with some command line arguments. I
have several scripts like this that loosely wrap SeqIO in a GNU
parallel friendly way, they end up being very handy and are worth the
investment of doing it properly and in a more permanent way.


Best,
Matt.

---
http://blog.mattoates.co.uk
http://www.mattoates.co.uk


On 8 May 2013 14:45, Carlos Pérez Cantalapiedra
<[email protected]> wrote:
> thank you Matt.
>
> About escape character I have another related question. Why do I need to
> escape the dollars here?
>
> cat sbcc073_pcm_ill_all.musket_default.fastq | head -8 | parallel --pipe
> "perl -lne 'if($.%4==1){s/^(@.*)_([12]).*/\$1\/\$2/;print}' "
>
> Regarding the starting "@", this is one of the things I don't like of
> working with FastQ format. When doing an AWK script or perl/python one I can
> manage that easy, just by number_of_line%4==1 for example. However, I don't
> know how to control that from sed expression. Of course, a more reliable way
> to avoid matching qual strings is adding some more characters from
> identifier (eg: @HUWSI...) but this makes the scripts less general... any
> other advice would be very welcome.
>
> best,
> Carlos
>
>
> 2013/5/8 Matt Oates (Home) <[email protected]>
>
>> Dear Carlos,
>>
>> You just need to quote the sed command so:
>>
>> cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | parallel
>> --pipe sed 's#^\(@.*\)_\([12]\).*#\1/\2#'
>>
>> becomes:
>>
>> cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | parallel
>> --pipe "sed 's#^\(@.*\)_\([12]\).*#\1/\2#'"
>>
>>
>> You might also want to define how the FASTQ records are separated
>> which is problematic if you have reads from anything other than
>> Illumina 1.5+ since the quality score can include @ symbols at the
>> start of a line. You could do something like the following to split
>> the pipe so that whole FASTQ records go to each job:
>>
>> parallel --recstart='^@' --pipe "sed 's#^\(@.*\)_\([12]\).*#\1/\2#'"
>>
>> something like the following might be more appropriate though:
>>
>> parallel -N 4 --pipe "sed 's#^\(@.*\)_\([12]\).*#\1/\2#'"
>>
>> This will tell parallel to take 4 lines at a time per job, just
>> increase to multiples of four to send that number of FASTQ records to
>> each job. Obviously with your current sed it doesn't actually matter
>> that you have one FASTQ record per job but it might be important in
>> the future.
>>
>> Best Wishes,
>> Matt.
>>
>> ---
>> http://blog.mattoates.co.uk
>> http://www.mattoates.co.uk
>>
>>
>> On 8 May 2013 11:01, Carlos Pérez Cantalapiedra
>> <[email protected]> wrote:
>> > Hello everyone,
>> >
>> > I am new to this list and to the parallel command. I hope answer to next
>> > question is not too obvious, but enough to get some advice :)
>> >
>> > I have to process a big file, and have been reading about parallel
>> > command
>> > to try to use more than 1 core processor when using sed, sort and so on.
>> > So
>> > I first wanted to change first line of every four (because of naming
>> > conventions of this kind of file - FastQ format).
>> >
>> > For example, this would be a group of four, and I want to modify the
>> > first
>> > line:
>> >
>> >     cat sbcc073_pcm_ill_all.musket_default.fastq | head -4
>> >
>> >     @HWUSI-EAS1752R:29:FC64CL3AAXX:8:65:16525:4289_1:N:0:ACTTGA
>> >     GCGAGAGAAT
>> >     +
>> >     GHHHHHHHHHH
>> >
>> > With the next command I have the work done:
>> >
>> >     cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | sed
>> > 's#^\(@.*\)_\([12]\).*#\1/\2#'
>> >
>> >     @HWUSI-EAS1752R:29:FC64CL3AAXX:8:65:16525:4289/1
>> >     GCGAGAGAAT
>> >     +
>> >     GHHHHHHHHHH
>> >
>> > However, when using parallel it seems that is not recognizing the group
>> > capture brackets:
>> >
>> >     cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | parallel
>> > --pipe
>> > sed 's#^\(@.*\)_\([12]\).*#\1/\2#'
>> >
>> >     @HWUSI-EAS1752R:29:FC64CL3AAXX:8:65:16525:4289_1:N:0:ACTTGA
>> >     GCGAGAGAAT
>> >     +
>> >     GHHHHHHHHHH
>> >
>> > When removing backslashes or using sed -r the command is telling me:
>> >
>> >     /bin/bash: -c: line 3: syntax error near unexpected token `('
>> >     /bin/bash: -c: line 3: `             (cat /tmp/60xrxvCIRX.chr; rm
>> > /tmp/60xrxvCIRX.chr; cat - ) | (sed s#^(@.*)_([12]).*#\1/\2# );'
>> >
>> > Could anyone put some light on this?
>> >
>> > thank you very much
>
>

Re: Parallel with sed group capture

Reply via email to