Re: Parallel with sed group capture

Ole Tange Wed, 15 May 2013 12:54:12 -0700

And if you choose to make a perlscript, remember --shebang-wrap. See
for example http://www.biostars.org/p/63816/ (EXAMPLE(advanced): Using
GNU Parallel to parallelize you own scripts).


/Ole

On Wed, May 8, 2013 at 3:53 PM, Matt Oates (Home) <[email protected]> wrote:
> Dear Carlos,
>
> If it was me doing this I would use a full Perl script instead of some
> oneliner with parallel. The issues with quoting and shell escapes are
> more problematic with parallel so I would just bypass them completely.
> Additionally I would parse the file using Bio::SeqIO
> (http://doc.bioperl.org/bioperl-live/Bio/SeqIO/fastq.html) then write
> out the edited ID versions using SeqIO too. Then you can just make a
> general perl script to work with GNU parallel that edits IDs of any
> file format by just wrapping SeqIO with some command line arguments. I
> have several scripts like this that loosely wrap SeqIO in a GNU
> parallel friendly way, they end up being very handy and are worth the
> investment of doing it properly and in a more permanent way.
>
> Best,
> Matt.
>
> ---
> http://blog.mattoates.co.uk
> http://www.mattoates.co.uk
>
>
> On 8 May 2013 14:45, Carlos Pérez Cantalapiedra
> <[email protected]> wrote:
>> thank you Matt.
>>
>> About escape character I have another related question. Why do I need to
>> escape the dollars here?
>>
>> cat sbcc073_pcm_ill_all.musket_default.fastq | head -8 | parallel --pipe
>> "perl -lne 'if($.%4==1){s/^(@.*)_([12]).*/\$1\/\$2/;print}' "
>>
>> Regarding the starting "@", this is one of the things I don't like of
>> working with FastQ format. When doing an AWK script or perl/python one I can
>> manage that easy, just by number_of_line%4==1 for example. However, I don't
>> know how to control that from sed expression. Of course, a more reliable way
>> to avoid matching qual strings is adding some more characters from
>> identifier (eg: @HUWSI...) but this makes the scripts less general... any
>> other advice would be very welcome.
>>
>> best,
>> Carlos
>>
>>
>> 2013/5/8 Matt Oates (Home) <[email protected]>
>>
>>> Dear Carlos,
>>>
>>> You just need to quote the sed command so:
>>>
>>> cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | parallel
>>> --pipe sed 's#^\(@.*\)_\([12]\).*#\1/\2#'
>>>
>>> becomes:
>>>
>>> cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | parallel
>>> --pipe "sed 's#^\(@.*\)_\([12]\).*#\1/\2#'"
>>>
>>>
>>> You might also want to define how the FASTQ records are separated
>>> which is problematic if you have reads from anything other than
>>> Illumina 1.5+ since the quality score can include @ symbols at the
>>> start of a line. You could do something like the following to split
>>> the pipe so that whole FASTQ records go to each job:
>>>
>>> parallel --recstart='^@' --pipe "sed 's#^\(@.*\)_\([12]\).*#\1/\2#'"
>>>
>>> something like the following might be more appropriate though:
>>>
>>> parallel -N 4 --pipe "sed 's#^\(@.*\)_\([12]\).*#\1/\2#'"
>>>
>>> This will tell parallel to take 4 lines at a time per job, just
>>> increase to multiples of four to send that number of FASTQ records to
>>> each job. Obviously with your current sed it doesn't actually matter
>>> that you have one FASTQ record per job but it might be important in
>>> the future.
>>>
>>> Best Wishes,
>>> Matt.
>>>
>>> ---
>>> http://blog.mattoates.co.uk
>>> http://www.mattoates.co.uk
>>>
>>>
>>> On 8 May 2013 11:01, Carlos Pérez Cantalapiedra
>>> <[email protected]> wrote:
>>> > Hello everyone,
>>> >
>>> > I am new to this list and to the parallel command. I hope answer to next
>>> > question is not too obvious, but enough to get some advice :)
>>> >
>>> > I have to process a big file, and have been reading about parallel
>>> > command
>>> > to try to use more than 1 core processor when using sed, sort and so on.
>>> > So
>>> > I first wanted to change first line of every four (because of naming
>>> > conventions of this kind of file - FastQ format).
>>> >
>>> > For example, this would be a group of four, and I want to modify the
>>> > first
>>> > line:
>>> >
>>> >     cat sbcc073_pcm_ill_all.musket_default.fastq | head -4
>>> >
>>> >     @HWUSI-EAS1752R:29:FC64CL3AAXX:8:65:16525:4289_1:N:0:ACTTGA
>>> >     GCGAGAGAAT
>>> >     +
>>> >     GHHHHHHHHHH
>>> >
>>> > With the next command I have the work done:
>>> >
>>> >     cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | sed
>>> > 's#^\(@.*\)_\([12]\).*#\1/\2#'
>>> >
>>> >     @HWUSI-EAS1752R:29:FC64CL3AAXX:8:65:16525:4289/1
>>> >     GCGAGAGAAT
>>> >     +
>>> >     GHHHHHHHHHH
>>> >
>>> > However, when using parallel it seems that is not recognizing the group
>>> > capture brackets:
>>> >
>>> >     cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | parallel
>>> > --pipe
>>> > sed 's#^\(@.*\)_\([12]\).*#\1/\2#'
>>> >
>>> >     @HWUSI-EAS1752R:29:FC64CL3AAXX:8:65:16525:4289_1:N:0:ACTTGA
>>> >     GCGAGAGAAT
>>> >     +
>>> >     GHHHHHHHHHH
>>> >
>>> > When removing backslashes or using sed -r the command is telling me:
>>> >
>>> >     /bin/bash: -c: line 3: syntax error near unexpected token `('
>>> >     /bin/bash: -c: line 3: `             (cat /tmp/60xrxvCIRX.chr; rm
>>> > /tmp/60xrxvCIRX.chr; cat - ) | (sed s#^(@.*)_([12]).*#\1/\2# );'
>>> >
>>> > Could anyone put some light on this?
>>> >
>>> > thank you very much
>>
>>
>

Re: Parallel with sed group capture

Reply via email to