Hi Patrick,

the issue you are having is partly related to the idea of Galaxy to
ensure reproducible science and saving each intermediate step and
output files. For example in your current workflow in Galaxy you can
easily do something else with each intermediate file - feed it to a
different tool just to check what the average read length is after
filtering - you can do that even 2 months after your run.
If you how ever insist on keeping disk usage low and don't want to
start programming - as your provided solutions will require - and
aren't too afraid of the commandline you might want to start there.

The thing is, a lot of tools accept either an input file or an input
stream. These same tools also have the ability to either write to an
output file or to an output stream. This way you can "pipe" these
tools together.
e.g. "trimMyFq -i rawinput.fq | removebarcode -i - -n optionN |
filterJunk -i - -o finalOutput.fq"

I don't know which programs you actually use, but the principle is
probably the same ( as long as the tools actually accept streams ).
This example saved you diskspace because from the 3 tools run, only
one actually writes to the disk. On the downside, this also means you
don't have an output file from removeBarcode which you can look at to
see if everything went ok.

If you do want to program or someone else wants to do it, I could
think of a tool that combines your iterative steps and can be run as
one tool - you could even wrap up your 'pipeline' in a script and put
that as a tool in your Galaxy instance and/or in the toolshed.


On Fri, Aug 19, 2011 at 6:29 PM, Patrick Page-McCaw
<ppagemc...@gmail.com> wrote:
> I'm not a bioinformaticist or programmer so apologies if this is a silly 
> question. I've been occasionally running galaxy on my laptop and on the 
> public server and I love it. The issue that I have is that my workflow 
> requires many steps (what I do is probably very unusual). Each step creates a 
> new large fastq file as the sequences are iteratively trimmed of junk. This 
> fills my laptop and fills the public server with lots of unnecessary very 
> large files.
> I've been thinking about the structure of the files and my workflow and it 
> seems to me that a more space efficient system would be to have a single file 
> (or a sql database) on which each tool can work. Most of what I do is remove 
> adapter sequences, extract barcodes, trim by quality, map to the genome and 
> then process my hits by type (exon, intron etc). Since the clean up tools in 
> FASTX aren't written with my problem in mind, it takes several passes to get 
> the sequences trimmed up before mapping.
> If I had a file that had a format something like (here as tab delimited):
> Header  Seq     Phred   Start   Len     Barcode etc
> Each tool could read the Seq and Phred starting at Start and running Len 
> nucleotides and work on that. The tool could then write a new Start and Len 
> to reflect the trimming it has done[1]. For convenience let me call this an 
> HSPh format.
> So it would be a real pain, no doubt, to rewrite all the tools. The little 
> that I can read the tools it seems that the way the input is handled 
> internally varies quite a bit. But it seems to me (naively?) that it would be 
> relatively easy to write a conversion tool that would take the HSPh format 
> and turn it into fastq or fast on the fly for the tools. Since most tools 
> take fastq or fasta, it should be a write once, use many times, plugin. The 
> harder (and slower) part would be mapping the fastq output back onto HSPh 
> format.  But again, this should be a write once, use for many tools plugin. 
> Both of the intermediating files would be deleted when done. Just as a real 
> quick test I thought I would see how long it takes to run sed on a fastq 
> 1.35GB file and it was so fast on my laptop, < 2 minutes, that it was done 
> before I noticed.
> Then as people are interested, the tools could be converted to take as input 
> the new format.
> It may well be true in these days of $100 terabyte drives, this is not 
> useful, that cycles are limiting, not drive space. But I think if the tools 
> were rewritten to take and write to a HSPh format, processing would be faster 
> too. It seems like some effort has been made to create the tab delimited 
> format and maybe someone is already working on something like this (no doubt 
> better designed).
> I may have a comp sci undergrad working in the lab this fall. With help we 
> (well, he) might manage some parts of this. He is apparently quite a talented 
> and hard working C++ programmer. Is it worth while?
> thanks
> [1] It could even do something like:
> Header Seq Phred Start Len Tool Parameter Start Len Tool Parameter Start Len 
> etc
> Tool is the tool name, Parameter a list of parameters used, Start and Len 
> would be the latest trim positions. And the last Start Len pair would be the 
> one to use by default for the next tool, but this would keep an edit history 
> without doubling the space needs with each processing cycle. I wouldn't need 
> this but it might be more friendly for users, an "undo" means removing 4 
> columns. A format like this would probably be better as a sql database.
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>  http://lists.bx.psu.edu/

Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:


Reply via email to