We are about to implement a fasta database (file) versioning system as a Galaxy
tool. I wanted to get interested people's feedback first before we roll ahead
with the prototype implementation. The versioning system aims to:
* Enable reproducible research: To recreate a search result at a certain point
in time we need versioning so that search and mapping tools can look at
sequence reference databases corresponding to a particular past date. This
recall can also explain the difference between what was known in the past vs.
* Reduce hard drive space. Some databases are too big to keep N copies around,
e.g. 5 years of 16S, updated monthly, is say, 670Mb + 668Mb + 665Mb + .... But
occasionally we want to access past archives fairly quickly.
* Integrate database versioning into Galaxy without adding a lot of complexity.
A bonus would be to enable the efficient sharing of version databases between
The solution we think would work centres around a "Versioned Data Retrieval"
tool (draft image attached) that would work as follows:
1) User selects from a list of databases provided by "Shared Data > Data
Libraries > Versioned Data".
- Each database has a master file that keeps its various versions as a list
of time-stamped insert/delete transactions of key (fasta id) value (description
& sequence) pairs.
- Each master file is managed outside of galaxy via a triggered process on
regular fasta file imports from data sources like NCBI or other niche sources.
- We're expecting, due to the nature of fasta archived sequence updates, that
our master file would only be about 1.1x the latest version in size
2) User enters date / version id to retrieve (validated)
3) If a cached version of that database exists, it is linked into user's
4) Otherwise a new version of it is created, placed in cache, and linked into
- The cached version itself then shows up as linked data under a Data Library
> Versioned Data subfolder.
5) User can select preconfigured workflow(s) to execute on the selected
retreived fasta file to regenerate any database products they need.
- Workflow output data would also be cached in the same way the fasta data is
- by linking the Galaxy Data Library to it.
- Workflow execution will be skipped if end data already exists in cache.
- Simple makeblastdb or bowtie-build commands, or more specific workflows
that include dustmasker etc can be implemented.
Does this sound attractive?
We're hoping such a vision could handle Fasta databases from 12mb to e.g. 200Gb
(probably requires makeblastdb in parallel at that scale).
Preliminary work suggests this project is doable via the Galaxy API without
galaxy customization - does that sound right?!
Feedback really appreciated!
Hsiao lab, BC Public Health Microbiology & Reference Laboratory, BC Centre for
655 West 12th Avenue, Vancouver, British Columbia, V5Z 4R4 Canada
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
To search Galaxy mailing lists use the unified search at: