Damion,

Thanks a lot - consequently treating the toppic of 'reproducable science' is a competition, but absolutely required. Björn really touched my mind when he gave the linked talk in Chicago (GCC 2012). Although for a longer time things got stuck, I think that Galaxy is still a (the?) key to it, because the principal structures allow it - it's somehow native to it. Since some important and powerful elements came up (think of e.g. the API), the gap for reaching reproducability also from the reference side using 'on-board tools' of the framework (without bending or disassembling the code too strong) has hardly narrowed.


Due to the medical context our instance is in, we really need features like this. In some particular detail I would maybe implement the respective functionalities a bit different (due to performance), but in vast majority I agree: this sounds attractive!

Marius' remark on data managers (which are brand new as far as I understood the GCC talks) sounds reasonable, although I did not get in touch with it yet.

So, count me in, I'm already a bit excited :).

Cheers,
Sebastian



Dooley, Damion schrieb:
Ok, I'll be very happy to see what you've accomplished there.  I will read 
through what you've done when I return from vacation in a week!

A key need is to have whatever data comes in show up as linked data in one's 
history to avoid server overhead; a second objective was to not need to modify 
existing workflows - as long as they could work of data in history that is 
typed appropriately.  So your 'select type' solution sounds intreguing!

And certainly interested in your use of git - I tried using git, using a 1-line 
fasta data format, but git seemed to choke on protein fasta files?  And did it 
run into performance problems with larger files?  That was my experience.  I 
think I read its authors say that its upper limit was 15gb.  That was the 
motivation for writing a simple key-value master file diff system that seems to 
have the same I/O as git on smaller files, but more reliable for the fasta data 
case, and no problems with larger files - it outputs a new version in the same 
time it takes to read a master file.  It has drawbacks though - incoming data 
to compare master with must be sorted in 1 line fasta format first.

Thanks for your input; looking forward to your project writeup...

Damion

Hsiao lab, BC Public Health Microbiology & Reference Laboratory, BC Centre for 
Disease Control
655 West 12th Avenue, Vancouver, British Columbia, V5Z 4R4 Canada
________________________________________
From: Björn Grüning [bjoern.gruen...@gmail.com]
Sent: Saturday, August 23, 2014 12:17 AM
To: Dooley, Damion; galaxy-dev@lists.bx.psu.edu
Cc: Hsiao, William
Subject: Re: [galaxy-dev] Concept for a Galaxy Versioned Fasta Data Retrieval 
Tool

Hi Damion,

the idea sounds fantastic!
Can we go a step further and use a specific datatype that keeps entire
fasta files versioned and the user can choose which version he wants to
use, in any tool? Please have a look at my talk at GCC2012. Maybe you
are interested in the (old) patches. I would be very interested to
restart this old project.

https://wiki.galaxyproject.org/Events/GCC2012/Abstracts#Keeping_Track_of_Life_Science_Data


Am 23.08.2014 um 03:24 schrieb Dooley, Damion:
We are about to implement a fasta database (file) versioning system as a Galaxy 
tool.  I wanted to get interested people's feedback first before we roll ahead 
with the prototype implementation.  The versioning system aims to:
....
    - Simple makeblastdb or bowtie-build commands, or more specific workflows 
that include dustmasker etc can be implemented.

Does this sound attractive?
I think all of the use cases are covered by the old project mentioned
above. But I did not create a new tool I have created a new 'select
type' everyone can use in all tools. It was using git underneath (yeah,
I have the entire PDB in git and it is working fine :)) but we can
probably change git with a database if you like.

To answer your question: Yes, very attractive!

We're hoping such a vision could handle Fasta databases from 12mb to e.g. 200Gb 
(probably requires makeblastdb in parallel at that scale).

Preliminary work suggests this project is doable via the Galaxy API without 
galaxy customization - does that sound right?!
Yes, as long as the User has an API key.

Cheers,
Bjoern
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
   http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
   http://galaxyproject.org/search/mailinglists/


--
Sebastian Schaaf, M.Sc. Bioinformatics
Faculty Coordinator NGS Infrastructure
Chair of Biometry and Bioinformatics
Department of Medical Informatics,
 Biometry and Epidemiology (IBE)
University of Munich
Marchioninistr. 15, K U1 (postal)
Marchioninistr. 17, U 006 (office)
D-81377 Munich (Germany)
Tel: +49 89 2180-78178

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
 http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
 http://galaxyproject.org/search/mailinglists/

Reply via email to