Re: [galaxy-dev] Concept for a Galaxy Versioned Fasta Data Retrieval Tool

Dooley, Damion Wed, 03 Sep 2014 18:46:11 -0700

Hi,

There have been a few comments about how general we could make the system for 
Galaxy use or just as a stand-alone command line driven tool.  So some notes 
below about what I could see it taking on.  Given the scale of the sequencing 
data problem, I'm sure the Galaxy community has important feedback on this.


I looked at git annex and it appears to me that though it promises to keep 
track of and synchronize network located files, it doesn't do versioning on 
them - am I wrong about that?

I also looked at https://code.google.com/p/leveldb/ , also a key value database 
which relies more heavily on indexes - but I see that though this is well-tuned 
to answering key queries, it isn't particularly good at storing and retrieving 
entire versions of a database that could be many gigabytes long, which is our 
mission.

It is relatively easy to generalize the simple keydb prototype I wrote so that 
it can handle any key-value database - including binary content and even binary 
key data, not just text (fasta sequences).  So a name change for the tool is a 
good idea. 

I want a versioning system that doesn't assume the incoming master file of 
key-value pairs is in the same order as it was on a previous import run.  I was 
afraid that any arbitrary change in the order of content on the source server 
could completely destroy the efficiency of a differential approach.  Git 
assumes its content is like a document - so it generates a slew of inserts and 
deletes, in fact provides no benefit, if the fasta entries are rearranged.  I 
tested helping git overcome this hurdle by converting the fasta content to 1 
line key/value fasta entries, and sorting them before git processing. That 
seemed to work for some smaller and larger nucleotide fasta files (tested 10m 
to 2gb) but failed when it came to processing protein fasta files; though 
possibly that was because of the fasta data line length.  That became another 
concern - thinking that git was failing because each line of the input file was 
many thousands of characters long.

So having done a "keydb" versioning engine that works and performs as well as 
git, I am definitely shying away from git now as unreliable on certain kinds of 
data.  The keydb approach is able to generate a version file at about the same 
speed that it takes to read the latest version of the same db, i.e. at 50mb/s 
on a standard hard drive.

An extension to keydb that enables it to take in just a list of adds or deletes 
or updates is desirable but that can come later.  More efficiency can be had by 
fine-tuning the updates so that one whole line of key-value doesn't have to 
replace the previous one but that's for later too.

A generalization note that the keydb approach works where the keys are a sparse 
array.  There's nothing stopping the keys from representing a 2D or 3D sparse 
array of data as long as the coordinates are coded uniquely into the one key 
list.

For those interested in versioning XML data there is an interesting summary of 
the challenges here:  
http://useless-factor.blogspot.ca/2008/01/matching-diffing-and-merging-xml.html 
.  It leaves me thinking that quick versioning of xml data could only be 
accomplished if it could somehow be converted into a key-value db, i.e. with 
each top level xml record identified by a unique key.

I could see breaking larger keydb databases up into smaller chunks for data 
retrieval and fast parallel processing - the usual approach being to separate 
the sorted key-value db out into files based on the first character or two in 
the key of each record.
  
Does this go along with people's expectations?

Cheers,

Damion 

________________________________________
From: Björn Grüning [bjoern.gruen...@gmail.com]
Sent: Monday, September 01, 2014 12:47 PM
To: Dooley, Damion; Björn Grüning; galaxy-dev@lists.bx.psu.edu
Cc: Hsiao, William
Subject: Re: [galaxy-dev] Concept for a Galaxy Versioned Fasta Data Retrieval 
Tool

Am 25.08.2014 um 18:05 schrieb Dooley, Damion:
> Ok, I'll be very happy to see what you've accomplished there.  I will read 
> through what you've done when I return from vacation in a week!
>
> A key need is to have whatever data comes in show up as linked data in one's 
> history to avoid server overhead;
a second objective was to not need to modify existing workflows - as
long as they could work of data in history that is typed appropriately.
  So your 'select type' solution sounds intreguing!
>
> And certainly interested in your use of git - I tried using git, using a 
> 1-line fasta data format, but git seemed to choke on protein fasta files?
> And did it run into performance problems with larger files?  That was my 
> experience.  I think I read its authors say that its upper limit was 15gb.

This is probably true for one large file. I'm storing the entire PDB in
git since a few years. One entry one file and it works fine.

Do you know git annex? https://git-annex.branchable.com/

>That was the motivation for writing a simple key-value master file diff system 
>that seems to have the same I/O as git on smaller files,
>but more reliable for the fasta data case, and no problems with larger files - 
>it outputs a new version in the same time it takes to read a master file.
>It has drawbacks though - incoming data to compare master with must be sorted 
>in 1 line fasta format first.

My intention was to create a universal solution for database tracking.
So if you can please design your system in such a way that you can store
arbitrary data, not only fasta files.

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] Concept for a Galaxy Versioned Fasta Data Retrieval Tool

Reply via email to