>>> i actually think illumina's pipeline produces files in this format >>>(unaligned-bam) now.
> Oh do they? - that's interesting. Do you have a reference/link? i caught wind of this at the recent illumina user's conference but i asked someone in our sequencing team to confirm and he hadn't heard of this. it must be limited to the forthcoming miseq sequencer for the timebeing, but may make it's way to the big sequencers later. apparently illumina is thinking about storage as well. i seem to recall the speaker saying they won't produce srf files anymore, but again, this was a talk about the miseq so may not apply to the other sequencers. >>> wrappers which create a temporary fastq file would need to be created >>> but that's easy enough. >> My argument against that is the cost of going from BAM -> temp >> fastq may be prohibitive, e.g. the need to generate very large >> temp fastq files on the fly as input for various applications may >> lead one back to just keeping a permanent FASTQ around anyway. > True - if you can't update the tools you need to take BAM. > In some cases at least you can pipe the gzipped FASTQ > into alignment tools which accepts FASTQ on stdin, so > there is no temp file per se. the tools really do need to support the format; the tmpfile was simply a workaround. some tools already support bam, more currently support fastq.gz. (someone here made the wrong bet years ago and had adopted a site-wide fastq.bz2 standard which only recently changed to fastq.gz.) but if illumina does start producing bam files in the future, then we can expect more tools to support that format. until they do, probably fastq.gz is a safe bet. of course there is a computational cost to compressing/uncompressing files but that's probably better than storing unnecessarily huge files. it's a trade-off. similarly, there's a trade-off involved in limiting read qc tools to a single/few big tools which wrap several tools, with many options. users can't play around with read qc but that may be too expensive (computationally and storage-wise). for the most part, a standard qc will do. one can spend a lot of time and effort to squeeze a bit more useful data out of a bad library, for example, when they probably should have just sequenced another library. i favor leaving the playing around to the r&d/development/qc team and just offering a canned/vetted qc solution to the average user. >> I recall hdf5 was planned as an alternate format (PacBio uses >> it, IIRC), and of course there is NCBI's .sra format. Anyone >> using the latter two? > Moving from the custom BGZF modified gzip format used in > BAM to HD5 has been proposed on the samtools mailing list > (as Chris knows), and there is a proof of principle implementation > too in BioHDF, http://www.hdfgroup.org/projects/biohdf/ > The SAM/BAM group didn't seem overly enthusiastic though. > For the NCBI's .sra format, there is no open specification, just > their public domain source code: > http://seqanswers.com/forums/showthread.php?t=12054 i believe hdf5 is an indexed data structure which, as you mentioned, isn't required for unprocessed reads. since i'm rapidly running out of storage, i think the best immediate solution for me is to deprecate all the fastq datatypes in favor of a new fastqsangergz and to bundle the read qc tools to eliminate intermediate files. sure, users won't be able to play around with their data as much, but my disk is 88% full and my cluster has been 100% occupied for 2-months straight, so less choice is probably better. ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/