Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files

Jarrad Hope Fri, 27 Jun 2014 23:51:29 -0700

Thank-you all for replying,

It's just as Jason suggests - Genbank, FASTA & EMBL are more or less
the defacto standards, I suspect FASTA will be phased out because (to
my knowledge) it does not support gene annotation, nevertheless, they
are all text based.


These formats usually insert linebreaks around 80 characters (a
cultural/human readability relic, whatever terminal output they had
the time)

I tried to find a Penguin genome sequence for you, The best I can find
is the complete penguin mitochrondrian dna, as you can see, fairly
small.
http://www.ncbi.nlm.nih.gov/nuccore/558603183?report=fasta
http://www.ncbi.nlm.nih.gov/nuccore/558603183?report=genbank

If you would like to checkout the source for a Human, please see
ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/
Don't ask me for a Makefile :) in near future you'll be able to print
sequences of this length, today we're limited to small sequences (such
as bacteria/virus) at ~30cents per basepair

Each chromosome packs quite well ~80MB packed, ~240MB unpacked
However these formats allow you to repesent multiple sequences in one file
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
<- ~850MB packed

Sidenote, Humans aren't that particulary more complicated than Rice
(in terms of genome size)
http://www.plantgdb.org/XGDB/phplib/download.php?GDB=Os

Other animal sequences - http://www.ensembl.org/index.html

Git is already being used very successfully for SBML, Synthetic
Biology Markup Language, an XML dialect for cell modelling.

I would show an example git repo of some open source cancer treatments
(various oncolytic viruses) I've been working on, unfortunately it's
not finished yet, but you can imagine something the size of penguin
mitochrondrial dna with essentially just text being deleted (gene
deletions) as commits.

I hope that helps - With the advancement of Synthetic and Systems
Biology, I really see these sequences benefiting from git.


On Sat, Jun 28, 2014 at 3:13 AM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Jun 27, 2014 at 12:55 PM, Jason Pyeron <[email protected]> wrote:
>>
>> The issue will be, if we talk about changes other than same length 
>> substitutions
>> (e.g. Down's Syndrome where it has an insertion of code) would require one 
>> code
>> per line for the diffs to work nicely.
>
> Not my area of expertise, but depending on what you are interested in
> - like protein encoding etc, I really think you don't need to do
> things character-per-character. You might want to break at interesting
> sequences (TATA box, and/or known long repeating sequences).
>
> So you could basically turn the "one long line" representation into
> multiple lines, by just looking for particular known interesting (or
> known particularly *UN*interesting) patterns, and whenever you see the
> pattern you create a new line, describing the pattern ("TATAAA" or
> "run of 128 U"), and then continue on the next line.
>
> Then you diff those "semantically enriched" streams instead of the raw data.
>
> But it probably depends on what you are looking for and at. Sometimes
> you might be looking at individual base pairs. And sometimes maybe you
> want to look at the codons, and consider condons that transcribe to
> the same amino acid to be the same, and not show up as a difference.
> So I could well imagine that you might want to have multiple different
> ways to generate these diffs. No?
>
>                Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files

Reply via email to