Hi Damion,

the idea sounds fantastic!
Can we go a step further and use a specific datatype that keeps entire fasta files versioned and the user can choose which version he wants to use, in any tool? Please have a look at my talk at GCC2012. Maybe you are interested in the (old) patches. I would be very interested to restart this old project.


https://wiki.galaxyproject.org/Events/GCC2012/Abstracts#Keeping_Track_of_Life_Science_Data


Am 23.08.2014 um 03:24 schrieb Dooley, Damion:
We are about to implement a fasta database (file) versioning system as a Galaxy 
tool.  I wanted to get interested people's feedback first before we roll ahead 
with the prototype implementation.  The versioning system aims to:

* Enable reproducible research: To recreate a search result at a certain point 
in time we need versioning so that search and mapping tools can look at 
sequence reference databases corresponding to a particular past date.  This 
recall can also explain the difference between what was known in the past vs. 
currently.

* Reduce hard drive space.  Some databases are too big to keep N copies around, 
e.g. 5 years of 16S, updated monthly, is say, 670Mb + 668Mb + 665Mb + ....  But 
occasionally we want to access past archives fairly quickly.

* Integrate database versioning into Galaxy without adding a lot of complexity.

A bonus would be to enable the efficient sharing of version databases between 
computers/servers.

The solution we think would work centres around a "Versioned Data Retrieval" 
tool (draft image attached) that would work as follows:

1) User selects from a list of databases provided by  "Shared Data > Data Libraries 
> Versioned Data".
   - Each database has a master file that keeps its various versions as a list of 
time-stamped insert/delete transactions of key (fasta id) value (description & 
sequence) pairs.
   - Each master file is managed outside of galaxy via a triggered process on 
regular fasta file imports from data sources like NCBI or other niche sources.
   - We're expecting, due to the nature of fasta archived sequence updates, 
that our master file would only be about 1.1x the latest version in size 
(uncompressed).
2) User enters date / version id to retrieve (validated)
3) If a cached version of that database exists, it is linked into user's 
history.
4) Otherwise a new version of it is created, placed in cache, and linked into 
history.
   - The cached version itself then shows up as linked data under a Data Library 
> Versioned Data subfolder.
5) User can select preconfigured workflow(s) to execute on the selected 
retreived fasta file to regenerate any database products they need.
   - Workflow output data would also be cached in the same way the fasta data 
is - by linking the Galaxy Data Library to it.
   - Workflow execution will be skipped if end data already exists in cache.
   - Simple makeblastdb or bowtie-build commands, or more specific workflows 
that include dustmasker etc can be implemented.

Does this sound attractive?

I think all of the use cases are covered by the old project mentioned above. But I did not create a new tool I have created a new 'select type' everyone can use in all tools. It was using git underneath (yeah, I have the entire PDB in git and it is working fine :)) but we can probably change git with a database if you like.

To answer your question: Yes, very attractive!

We're hoping such a vision could handle Fasta databases from 12mb to e.g. 200Gb 
(probably requires makeblastdb in parallel at that scale).

Preliminary work suggests this project is doable via the Galaxy API without 
galaxy customization - does that sound right?!

Yes, as long as the User has an API key.

Cheers,
Bjoern

Feedback really appreciated!

Regards,

Damion Dooley

Hsiao lab, BC Public Health Microbiology & Reference Laboratory, BC Centre for 
Disease Control
655 West 12th Avenue, Vancouver, British Columbia, V5Z 4R4 Canada



___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
   http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
   http://galaxyproject.org/search/mailinglists/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
 http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
 http://galaxyproject.org/search/mailinglists/

Reply via email to