Hi all, On the continuing topic of the nebulous FASTQ format, are there any strong views as to weather a FASTQ files could hold records without a sequence (and therefore no quality scores)? This could make sense as output from an (agressive) quality filter.
This is corner case, and applies to other file formats too of course (e.g. FASTA). I mentioned this to Peter Rice (EMBOSS) off list, and he replied: On Thu, Jul 30, 2009 at 2:56 PM, Peter Rice<[email protected]> wrote: > EMBOSS rejects zero length sequences - something we put in some years > ago for misformatted FASTA files that someone ran through a Taverna > workflow to launch clustalw via EMBOSS's "emma". The user had got his > carriage control characters mangled so the sequence was appended to the > FASTA '>' line and appeared as a long description with no sequence. > > I can well imagine for filtering paired reads that zero length sequences > would be useful. > > At the point where the test is made we know the sequence format. > We can therefore define some or all formats as accepting or rejecting > zero length sequences. > > Similarly we can easily extend to define some applications (e.g. emma) > as requiring a minimum sequence length. > > regards, > > Peter Peter Rice is of course correct - in general the meaning and validity of a zero length sequence is context dependent. I think Peter Rice makes a good point regarding paired end reads. What I assume we was getting at is the situation where due to quality trimming, one of a pair might be trimmed to nothing - leaving essentially a singleton read. However, paired end reads are normally stored using a matched pair of FASTQ files, so it could be important to keep the zero length read present, so that they can be read in together in sync. If we do want to allow zero length sequences in FASTQ, would both of the following be valid? Should there be empty sequence and quality lines, or no sequence and quality lines? "@identifier\n+\n" (two lines, just the @ and + lines) "@identifier\n\n+\n\n" (four lines, including blank seq and qual lines) or with the repeated identifier on the plus lines: "@identifier\n+identifier\n" (two lines, just the @ and + lines) "@identifier\n\n+identifier\n\n" (four lines, including blank lines) As we are recommending no line wrapping on output this means typical FASTQ records would be four lines - so doing the same makes sense here too. Peter C. _______________________________________________ EMBOSS mailing list [email protected] http://lists.open-bio.org/mailman/listinfo/emboss
