I'm not a bioinformaticist or programmer so apologies if this is a silly 
question. I've been occasionally running galaxy on my laptop and on the public 
server and I love it. The issue that I have is that my workflow requires many 
steps (what I do is probably very unusual). Each step creates a new large fastq 
file as the sequences are iteratively trimmed of junk. This fills my laptop and 
fills the public server with lots of unnecessary very large files.

I've been thinking about the structure of the files and my workflow and it 
seems to me that a more space efficient system would be to have a single file 
(or a sql database) on which each tool can work. Most of what I do is remove 
adapter sequences, extract barcodes, trim by quality, map to the genome and 
then process my hits by type (exon, intron etc). Since the clean up tools in 
FASTX aren't written with my problem in mind, it takes several passes to get 
the sequences trimmed up before mapping. 

If I had a file that had a format something like (here as tab delimited):
Header  Seq     Phred   Start   Len     Barcode etc
Each tool could read the Seq and Phred starting at Start and running Len 
nucleotides and work on that. The tool could then write a new Start and Len to 
reflect the trimming it has done[1]. For convenience let me call this an HSPh 
format.

So it would be a real pain, no doubt, to rewrite all the tools. The little that 
I can read the tools it seems that the way the input is handled internally 
varies quite a bit. But it seems to me (naively?) that it would be relatively 
easy to write a conversion tool that would take the HSPh format and turn it 
into fastq or fast on the fly for the tools. Since most tools take fastq or 
fasta, it should be a write once, use many times, plugin. The harder (and 
slower) part would be mapping the fastq output back onto HSPh format.  But 
again, this should be a write once, use for many tools plugin. Both of the 
intermediating files would be deleted when done. Just as a real quick test I 
thought I would see how long it takes to run sed on a fastq 1.35GB file and it 
was so fast on my laptop, < 2 minutes, that it was done before I noticed. 

Then as people are interested, the tools could be converted to take as input 
the new format.

It may well be true in these days of $100 terabyte drives, this is not useful, 
that cycles are limiting, not drive space. But I think if the tools were 
rewritten to take and write to a HSPh format, processing would be faster too. 
It seems like some effort has been made to create the tab delimited format and 
maybe someone is already working on something like this (no doubt better 
designed).

I may have a comp sci undergrad working in the lab this fall. With help we 
(well, he) might manage some parts of this. He is apparently quite a talented 
and hard working C++ programmer. Is it worth while? 

thanks

[1] It could even do something like:
Header Seq Phred Start Len Tool Parameter Start Len Tool Parameter Start Len etc
Tool is the tool name, Parameter a list of parameters used, Start and Len would 
be the latest trim positions. And the last Start Len pair would be the one to 
use by default for the next tool, but this would keep an edit history without 
doubling the space needs with each processing cycle. I wouldn't need this but 
it might be more friendly for users, an "undo" means removing 4 columns. A 
format like this would probably be better as a sql database.
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Reply via email to