Re: [galaxy-dev] disk space and file formats

Edward Kirton Thu, 01 Sep 2011 15:03:21 -0700

Read QC intermediate files account for most of the storage used on our
galaxy site. And it's a real problem that I must solve soon.


My first attempt at taming the beast was to try to create a single read QC
tool that did such things as convert qual encoding, qual-end trimming, etc.
(very basic functions).  Such a tool could simply be a wrapper around your
favorite existing tools, but doesn't keep the intermediate files.  The added
benefit is that it runs faster because it only has to queue onto the cluster
once.

Sure, one might argue that it's nice to have all the intermediate files just
in case you wish to review them, but in practice, I have found this happens
relatively infrequently and is too expensive.  If you're a small lab maybe
that's fine, but if you generate a lot of sequence, a more production-line
approach is reasonable.

I've been toying with the idea of replacing all the fastq datatypes with a
single fastq datatype that is sanger-encoded and gzipped.  I think gzipped
reads files are about 1/4 of the unpacked version.  Of course, many tools
will require a wrapper if they don't accept gzipped input, but that's
trivial (and many already support compressed reads).

However the import tool automatically uncompressed uploaded files so I'd
need to do some hacking there to prevent this.

Heck, what we really need is a nice compact binary format for reads, perhaps
which doesn't even store ids (although pairing would need to be recorded).

Thoughts?

On Fri, Aug 19, 2011 at 11:43 AM, Jelle Scholtalbers <
j.scholtalb...@gmail.com> wrote:

> Hi Patrick,
>
> the issue you are having is partly related to the idea of Galaxy to
> ensure reproducible science and saving each intermediate step and
> output files. For example in your current workflow in Galaxy you can
> easily do something else with each intermediate file - feed it to a
> different tool just to check what the average read length is after
> filtering - you can do that even 2 months after your run.
> If you how ever insist on keeping disk usage low and don't want to
> start programming - as your provided solutions will require - and
> aren't too afraid of the commandline you might want to start there.
>
> The thing is, a lot of tools accept either an input file or an input
> stream. These same tools also have the ability to either write to an
> output file or to an output stream. This way you can "pipe" these
> tools together.
> e.g. "trimMyFq -i rawinput.fq | removebarcode -i - -n optionN |
> filterJunk -i - -o finalOutput.fq"
>
> I don't know which programs you actually use, but the principle is
> probably the same ( as long as the tools actually accept streams ).
> This example saved you diskspace because from the 3 tools run, only
> one actually writes to the disk. On the downside, this also means you
> don't have an output file from removeBarcode which you can look at to
> see if everything went ok.
>
> If you do want to program or someone else wants to do it, I could
> think of a tool that combines your iterative steps and can be run as
> one tool - you could even wrap up your 'pipeline' in a script and put
> that as a tool in your Galaxy instance and/or in the toolshed.
>
> Cheers,
> Jelle
>
>
>
> On Fri, Aug 19, 2011 at 6:29 PM, Patrick Page-McCaw
> <ppagemc...@gmail.com> wrote:
> > I'm not a bioinformaticist or programmer so apologies if this is a silly
> question. I've been occasionally running galaxy on my laptop and on the
> public server and I love it. The issue that I have is that my workflow
> requires many steps (what I do is probably very unusual). Each step creates
> a new large fastq file as the sequences are iteratively trimmed of junk.
> This fills my laptop and fills the public server with lots of unnecessary
> very large files.
> >
> > I've been thinking about the structure of the files and my workflow and
> it seems to me that a more space efficient system would be to have a single
> file (or a sql database) on which each tool can work. Most of what I do is
> remove adapter sequences, extract barcodes, trim by quality, map to the
> genome and then process my hits by type (exon, intron etc). Since the clean
> up tools in FASTX aren't written with my problem in mind, it takes several
> passes to get the sequences trimmed up before mapping.
> >
> > If I had a file that had a format something like (here as tab delimited):
> > Header  Seq     Phred   Start   Len     Barcode etc
> > Each tool could read the Seq and Phred starting at Start and running Len
> nucleotides and work on that. The tool could then write a new Start and Len
> to reflect the trimming it has done[1]. For convenience let me call this an
> HSPh format.
> >
> > So it would be a real pain, no doubt, to rewrite all the tools. The
> little that I can read the tools it seems that the way the input is handled
> internally varies quite a bit. But it seems to me (naively?) that it would
> be relatively easy to write a conversion tool that would take the HSPh
> format and turn it into fastq or fast on the fly for the tools. Since most
> tools take fastq or fasta, it should be a write once, use many times,
> plugin. The harder (and slower) part would be mapping the fastq output back
> onto HSPh format.  But again, this should be a write once, use for many
> tools plugin. Both of the intermediating files would be deleted when done.
> Just as a real quick test I thought I would see how long it takes to run sed
> on a fastq 1.35GB file and it was so fast on my laptop, < 2 minutes, that it
> was done before I noticed.
> >
> > Then as people are interested, the tools could be converted to take as
> input the new format.
> >
> > It may well be true in these days of $100 terabyte drives, this is not
> useful, that cycles are limiting, not drive space. But I think if the tools
> were rewritten to take and write to a HSPh format, processing would be
> faster too. It seems like some effort has been made to create the tab
> delimited format and maybe someone is already working on something like this
> (no doubt better designed).
> >
> > I may have a comp sci undergrad working in the lab this fall. With help
> we (well, he) might manage some parts of this. He is apparently quite a
> talented and hard working C++ programmer. Is it worth while?
> >
> > thanks
> >
> > [1] It could even do something like:
> > Header Seq Phred Start Len Tool Parameter Start Len Tool Parameter Start
> Len etc
> > Tool is the tool name, Parameter a list of parameters used, Start and Len
> would be the latest trim positions. And the last Start Len pair would be the
> one to use by default for the next tool, but this would keep an edit history
> without doubling the space needs with each processing cycle. I wouldn't need
> this but it might be more friendly for users, an "undo" means removing 4
> columns. A format like this would probably be better as a sql database.
> > ___________________________________________________________
> > Please keep all replies on the list by using "reply all"
> > in your mail client.  To manage your subscriptions to this
> > and other Galaxy lists, please use the interface at:
> >
> >  http://lists.bx.psu.edu/
> >
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>
>  http://lists.bx.psu.edu/
>

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] disk space and file formats

Reply via email to