And if you choose to make a perlscript, remember --shebang-wrap. See for example http://www.biostars.org/p/63816/ (EXAMPLE(advanced): Using GNU Parallel to parallelize you own scripts).
/Ole On Wed, May 8, 2013 at 3:53 PM, Matt Oates (Home) <[email protected]> wrote: > Dear Carlos, > > If it was me doing this I would use a full Perl script instead of some > oneliner with parallel. The issues with quoting and shell escapes are > more problematic with parallel so I would just bypass them completely. > Additionally I would parse the file using Bio::SeqIO > (http://doc.bioperl.org/bioperl-live/Bio/SeqIO/fastq.html) then write > out the edited ID versions using SeqIO too. Then you can just make a > general perl script to work with GNU parallel that edits IDs of any > file format by just wrapping SeqIO with some command line arguments. I > have several scripts like this that loosely wrap SeqIO in a GNU > parallel friendly way, they end up being very handy and are worth the > investment of doing it properly and in a more permanent way. > > Best, > Matt. > > --- > http://blog.mattoates.co.uk > http://www.mattoates.co.uk > > > On 8 May 2013 14:45, Carlos Pérez Cantalapiedra > <[email protected]> wrote: >> thank you Matt. >> >> About escape character I have another related question. Why do I need to >> escape the dollars here? >> >> cat sbcc073_pcm_ill_all.musket_default.fastq | head -8 | parallel --pipe >> "perl -lne 'if($.%4==1){s/^(@.*)_([12]).*/\$1\/\$2/;print}' " >> >> Regarding the starting "@", this is one of the things I don't like of >> working with FastQ format. When doing an AWK script or perl/python one I can >> manage that easy, just by number_of_line%4==1 for example. However, I don't >> know how to control that from sed expression. Of course, a more reliable way >> to avoid matching qual strings is adding some more characters from >> identifier (eg: @HUWSI...) but this makes the scripts less general... any >> other advice would be very welcome. >> >> best, >> Carlos >> >> >> 2013/5/8 Matt Oates (Home) <[email protected]> >> >>> Dear Carlos, >>> >>> You just need to quote the sed command so: >>> >>> cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | parallel >>> --pipe sed 's#^\(@.*\)_\([12]\).*#\1/\2#' >>> >>> becomes: >>> >>> cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | parallel >>> --pipe "sed 's#^\(@.*\)_\([12]\).*#\1/\2#'" >>> >>> >>> You might also want to define how the FASTQ records are separated >>> which is problematic if you have reads from anything other than >>> Illumina 1.5+ since the quality score can include @ symbols at the >>> start of a line. You could do something like the following to split >>> the pipe so that whole FASTQ records go to each job: >>> >>> parallel --recstart='^@' --pipe "sed 's#^\(@.*\)_\([12]\).*#\1/\2#'" >>> >>> something like the following might be more appropriate though: >>> >>> parallel -N 4 --pipe "sed 's#^\(@.*\)_\([12]\).*#\1/\2#'" >>> >>> This will tell parallel to take 4 lines at a time per job, just >>> increase to multiples of four to send that number of FASTQ records to >>> each job. Obviously with your current sed it doesn't actually matter >>> that you have one FASTQ record per job but it might be important in >>> the future. >>> >>> Best Wishes, >>> Matt. >>> >>> --- >>> http://blog.mattoates.co.uk >>> http://www.mattoates.co.uk >>> >>> >>> On 8 May 2013 11:01, Carlos Pérez Cantalapiedra >>> <[email protected]> wrote: >>> > Hello everyone, >>> > >>> > I am new to this list and to the parallel command. I hope answer to next >>> > question is not too obvious, but enough to get some advice :) >>> > >>> > I have to process a big file, and have been reading about parallel >>> > command >>> > to try to use more than 1 core processor when using sed, sort and so on. >>> > So >>> > I first wanted to change first line of every four (because of naming >>> > conventions of this kind of file - FastQ format). >>> > >>> > For example, this would be a group of four, and I want to modify the >>> > first >>> > line: >>> > >>> > cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 >>> > >>> > @HWUSI-EAS1752R:29:FC64CL3AAXX:8:65:16525:4289_1:N:0:ACTTGA >>> > GCGAGAGAAT >>> > + >>> > GHHHHHHHHHH >>> > >>> > With the next command I have the work done: >>> > >>> > cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | sed >>> > 's#^\(@.*\)_\([12]\).*#\1/\2#' >>> > >>> > @HWUSI-EAS1752R:29:FC64CL3AAXX:8:65:16525:4289/1 >>> > GCGAGAGAAT >>> > + >>> > GHHHHHHHHHH >>> > >>> > However, when using parallel it seems that is not recognizing the group >>> > capture brackets: >>> > >>> > cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | parallel >>> > --pipe >>> > sed 's#^\(@.*\)_\([12]\).*#\1/\2#' >>> > >>> > @HWUSI-EAS1752R:29:FC64CL3AAXX:8:65:16525:4289_1:N:0:ACTTGA >>> > GCGAGAGAAT >>> > + >>> > GHHHHHHHHHH >>> > >>> > When removing backslashes or using sed -r the command is telling me: >>> > >>> > /bin/bash: -c: line 3: syntax error near unexpected token `(' >>> > /bin/bash: -c: line 3: ` (cat /tmp/60xrxvCIRX.chr; rm >>> > /tmp/60xrxvCIRX.chr; cat - ) | (sed s#^(@.*)_([12]).*#\1/\2# );' >>> > >>> > Could anyone put some light on this? >>> > >>> > thank you very much >> >> >
