On Sep 2, 2011, at 8:02 PM, Peter Cock wrote:

> On Fri, Sep 2, 2011 at 9:27 PM, Fields, Christopher J
> <cjfie...@illinois.edu> wrote:
>> On Sep 2, 2011, at 3:02 PM, Edward Kirton wrote:
>> 
>>>> What, like a BAM file of unaligned reads? Uses gzip compression, and
>>>> tracks the pairing information explicitly :) Some tools will already take
>>>> this as an input format, but not all.
>>> 
>>> ah, yes, precisely.  i actually think illumina's pipeline produces
>>> files in this format now.
> 
> Oh do they? - that's interesting. Do you have a reference/link?
> 
>>> wrappers which create a temporary fastq file would need to be created
>>> but that's easy enough.
>> 
>> My argument against that is the cost of going from BAM -> temp
>> fastq may be prohibitive, e.g. the need to generate very large
>> temp fastq files on the fly as input for various applications may
>> lead one back to just keeping a permanent FASTQ around anyway.
> 
> True - if you can't update the tools you need to take BAM.
> In some cases at least you can pipe the gzipped FASTQ
> into alignment tools which accepts FASTQ on stdin, so
> there is no temp file per se.

Some applications (Velvet for instance) accept gzipped FASTQ, though they may 
turn around and dump the data out uncompressed.

>>  One could probably get better performance out of a simpler
>> format that removes most of the 'AM' parts of BAM.
> 
> Yes, but that meaning inventing yet another file format. At least
> gzipped FASTQ is quite straightforward.

Yes.

>> Or is the idea that the file itself is modified, like a database?
> 
> That would be quite a dramatic change from the current
> Galaxy workflow system - I doubt that would be acceptable
> in general.

My thought as well.

>> And how would indexing work (BAM uses binning on the
>> match to the reference seq), or does it matter?
> 
> BAM indexing as done in samtools/picard is only for the aligned
> reads - so no help for a BAM file of unaligned reads. You could
> use a different indexing system (e.g. by read name) and the
> same BAM BGZF block offset system (I've tried this as an
> experiment with Biopython's SQLite indexing of sequence files).
> 
> However, for tasks taking unaligned reads as input, you
> generally just iterate over the reads in the order on disk.

I think, unless there is a demonstrable advantage to using unaligned BAM, 
fastq.gz is the easiest.

>> I recall hdf5 was planned as an alternate format (PacBio uses
>> it, IIRC), and of course there is NCBI's .sra format.  Anyone
>> using the latter two?
> 
> Moving from the custom BGZF modified gzip format used in
> BAM to HD5 has been proposed on the samtools mailing list
> (as Chris knows), and there is a proof of principle implementation
> too in BioHDF, http://www.hdfgroup.org/projects/biohdf/
> The SAM/BAM group didn't seem overly enthusiastic though.

Probably not, as it is somewhat a competitor of SAM/BAM (a bit broader in 
scope, beyond just alignments).  As Peter indicated, I know the BioHDF folks 
(they are here in town); however, my actual question was whether anyone is 
actually using HDF5 or SRA in production?  I haven't seen adoption beyond 
PacBio, but I have seen some things popping up in Galaxy.

> For the NCBI's .sra format, there is no open specification, just
> their public domain source code:
> http://seqanswers.com/forums/showthread.php?t=12054
> 
> Regards,
> 
> Peter

Simply gzipping FASTQ seems to give better compression that an .lite.sra file 
(and I'm not a happy user of their SRA toolset).  And of course there is 
parallel gzip...

chris


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Reply via email to