Dear Carlos, If it was me doing this I would use a full Perl script instead of some oneliner with parallel. The issues with quoting and shell escapes are more problematic with parallel so I would just bypass them completely. Additionally I would parse the file using Bio::SeqIO (http://doc.bioperl.org/bioperl-live/Bio/SeqIO/fastq.html) then write out the edited ID versions using SeqIO too. Then you can just make a general perl script to work with GNU parallel that edits IDs of any file format by just wrapping SeqIO with some command line arguments. I have several scripts like this that loosely wrap SeqIO in a GNU parallel friendly way, they end up being very handy and are worth the investment of doing it properly and in a more permanent way.
Best, Matt. --- http://blog.mattoates.co.uk http://www.mattoates.co.uk On 8 May 2013 14:45, Carlos Pérez Cantalapiedra <[email protected]> wrote: > thank you Matt. > > About escape character I have another related question. Why do I need to > escape the dollars here? > > cat sbcc073_pcm_ill_all.musket_default.fastq | head -8 | parallel --pipe > "perl -lne 'if($.%4==1){s/^(@.*)_([12]).*/\$1\/\$2/;print}' " > > Regarding the starting "@", this is one of the things I don't like of > working with FastQ format. When doing an AWK script or perl/python one I can > manage that easy, just by number_of_line%4==1 for example. However, I don't > know how to control that from sed expression. Of course, a more reliable way > to avoid matching qual strings is adding some more characters from > identifier (eg: @HUWSI...) but this makes the scripts less general... any > other advice would be very welcome. > > best, > Carlos > > > 2013/5/8 Matt Oates (Home) <[email protected]> > >> Dear Carlos, >> >> You just need to quote the sed command so: >> >> cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | parallel >> --pipe sed 's#^\(@.*\)_\([12]\).*#\1/\2#' >> >> becomes: >> >> cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | parallel >> --pipe "sed 's#^\(@.*\)_\([12]\).*#\1/\2#'" >> >> >> You might also want to define how the FASTQ records are separated >> which is problematic if you have reads from anything other than >> Illumina 1.5+ since the quality score can include @ symbols at the >> start of a line. You could do something like the following to split >> the pipe so that whole FASTQ records go to each job: >> >> parallel --recstart='^@' --pipe "sed 's#^\(@.*\)_\([12]\).*#\1/\2#'" >> >> something like the following might be more appropriate though: >> >> parallel -N 4 --pipe "sed 's#^\(@.*\)_\([12]\).*#\1/\2#'" >> >> This will tell parallel to take 4 lines at a time per job, just >> increase to multiples of four to send that number of FASTQ records to >> each job. Obviously with your current sed it doesn't actually matter >> that you have one FASTQ record per job but it might be important in >> the future. >> >> Best Wishes, >> Matt. >> >> --- >> http://blog.mattoates.co.uk >> http://www.mattoates.co.uk >> >> >> On 8 May 2013 11:01, Carlos Pérez Cantalapiedra >> <[email protected]> wrote: >> > Hello everyone, >> > >> > I am new to this list and to the parallel command. I hope answer to next >> > question is not too obvious, but enough to get some advice :) >> > >> > I have to process a big file, and have been reading about parallel >> > command >> > to try to use more than 1 core processor when using sed, sort and so on. >> > So >> > I first wanted to change first line of every four (because of naming >> > conventions of this kind of file - FastQ format). >> > >> > For example, this would be a group of four, and I want to modify the >> > first >> > line: >> > >> > cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 >> > >> > @HWUSI-EAS1752R:29:FC64CL3AAXX:8:65:16525:4289_1:N:0:ACTTGA >> > GCGAGAGAAT >> > + >> > GHHHHHHHHHH >> > >> > With the next command I have the work done: >> > >> > cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | sed >> > 's#^\(@.*\)_\([12]\).*#\1/\2#' >> > >> > @HWUSI-EAS1752R:29:FC64CL3AAXX:8:65:16525:4289/1 >> > GCGAGAGAAT >> > + >> > GHHHHHHHHHH >> > >> > However, when using parallel it seems that is not recognizing the group >> > capture brackets: >> > >> > cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | parallel >> > --pipe >> > sed 's#^\(@.*\)_\([12]\).*#\1/\2#' >> > >> > @HWUSI-EAS1752R:29:FC64CL3AAXX:8:65:16525:4289_1:N:0:ACTTGA >> > GCGAGAGAAT >> > + >> > GHHHHHHHHHH >> > >> > When removing backslashes or using sed -r the command is telling me: >> > >> > /bin/bash: -c: line 3: syntax error near unexpected token `(' >> > /bin/bash: -c: line 3: ` (cat /tmp/60xrxvCIRX.chr; rm >> > /tmp/60xrxvCIRX.chr; cat - ) | (sed s#^(@.*)_([12]).*#\1/\2# );' >> > >> > Could anyone put some light on this? >> > >> > thank you very much > >
