I also think that this a great idea, and as you described it I think it's
feasible as a stand-alone galaxy tool.
Eventually you consider to implement this as a data manager (
On 23 August 2014 03:24, Dooley, Damion <damion.doo...@bccdc.ca> wrote:
> We are about to implement a fasta database (file) versioning system as a
> Galaxy tool. I wanted to get interested people's feedback first before we
> roll ahead with the prototype implementation. The versioning system aims
> * Enable reproducible research: To recreate a search result at a certain
> point in time we need versioning so that search and mapping tools can look
> at sequence reference databases corresponding to a particular past date.
> This recall can also explain the difference between what was known in the
> past vs. currently.
> * Reduce hard drive space. Some databases are too big to keep N copies
> around, e.g. 5 years of 16S, updated monthly, is say, 670Mb + 668Mb + 665Mb
> + .... But occasionally we want to access past archives fairly quickly.
> * Integrate database versioning into Galaxy without adding a lot of
> A bonus would be to enable the efficient sharing of version databases
> between computers/servers.
> The solution we think would work centres around a "Versioned Data
> Retrieval" tool (draft image attached) that would work as follows:
> 1) User selects from a list of databases provided by "Shared Data > Data
> Libraries > Versioned Data".
> - Each database has a master file that keeps its various versions as a
> list of time-stamped insert/delete transactions of key (fasta id) value
> (description & sequence) pairs.
> - Each master file is managed outside of galaxy via a triggered process
> on regular fasta file imports from data sources like NCBI or other niche
> - We're expecting, due to the nature of fasta archived sequence updates,
> that our master file would only be about 1.1x the latest version in size
> 2) User enters date / version id to retrieve (validated)
> 3) If a cached version of that database exists, it is linked into user's
> 4) Otherwise a new version of it is created, placed in cache, and linked
> into history.
> - The cached version itself then shows up as linked data under a Data
> Library > Versioned Data subfolder.
> 5) User can select preconfigured workflow(s) to execute on the selected
> retreived fasta file to regenerate any database products they need.
> - Workflow output data would also be cached in the same way the fasta
> data is - by linking the Galaxy Data Library to it.
> - Workflow execution will be skipped if end data already exists in cache.
> - Simple makeblastdb or bowtie-build commands, or more specific
> workflows that include dustmasker etc can be implemented.
> Does this sound attractive?
> We're hoping such a vision could handle Fasta databases from 12mb to e.g.
> 200Gb (probably requires makeblastdb in parallel at that scale).
> Preliminary work suggests this project is doable via the Galaxy API
> without galaxy customization - does that sound right?!
> Feedback really appreciated!
> Damion Dooley
> Hsiao lab, BC Public Health Microbiology & Reference Laboratory, BC Centre
> for Disease Control
> 655 West 12th Avenue, Vancouver, British Columbia, V5Z 4R4 Canada
> Please keep all replies on the list by using "reply all"
> in your mail client. To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> To search Galaxy mailing lists use the unified search at:
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
To search Galaxy mailing lists use the unified search at: