On Tue, Aug 25, 2009 at 03:46:54PM -0700, Christopher Lee wrote:
-> On Aug 25, 2009, at 3:21 PM, C. Titus Brown wrote:
->
-> > -> I don't know if this is the same idea, but it would be great to
-> > mirror
-> > -> some portion of gff files natively in pygr-- gene entries have mRNA
-> > -> children which have exon children. That's a natural graph
-> > -> relationship which would be great to have available in pygr. Is
-> > there
-> > -> code already to do this kind of thing? If not, what would be the
-> > best
-> > -> approach? I can think of a few...
-> > ->
-> > -> * Read the gff file and store its rows as 'Bags' as Titus did in
-> > his
-> > -> gff parser-- have myGene.exons map to the 'exon' entries that are
-> > -> children of myGene. Let pygr pickle this as a dictionary of genes.
-> > -> This would mean having a very large dictionary in-memory whenever
-> > you
-> > -> wanted to use the annotations.
-> > -> * Create an sqlite or mysql db from the gff file and make that
-> > -> resource available to pygr.
-> > ->
-> > -> Is there a better way to do this?
-> >
-> > I think the latter approach is the right one, and I'm hoping that
-> > one of
-> > the tutorials-in-progress will address how to do this. If not I'll
-> > lead
-> > a counter-charge; this is something I still don't understand how to do
-> > in pygr, and I need to do it!
->
-> I think Titus is referring to me when he says "tutorials-in-
-> progress"... I will be happy to include this, but it would help me a
-> lot to get started on a GFF example if someone could send me their
-> example code for parsing GFF annotations (e.g. exons) from a file (or
-> a database or whatever is a typical place people are already storing
-> their GFF annotations in). I'd prefer to be able to focus my Pygr
-> tutorial effort on illustrating how to use Pygr classes, rather than
-> researching the GFF formats (which I haven't used yet).
Hey Chris,
I can adapt whatever you write to cover some form of GFF; every GFF
format is different in how it deals with exons, AFAICT. (OK, a slight
exaggeration ;)
How about using this as an example:
http://hgdownload.cse.ucsc.edu/goldenPath/galGal3/database/refGene.txt.gz
and adapting the attached code? It's not my cleanest code ever (to say
the least) but it should save you from having to muck around with the
file formats.
cheers,
--titus
--
C. Titus Brown, [email protected]
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"pygr-dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---
from pygr import seqdb, cnestedlist, annotation, sqlgraph
import sqlite3
genome = seqdb.SequenceFileDB('/scratch/titus/chick/chick.masked.fa')
db = sqlite3.connect('refGene.sqlite')
c = db.cursor()
c.execute('CREATE TABLE annotations (k INTEGER PRIMARY KEY, name TEXT, seq_id TEXT, start INT, stop INT, orientation INT);')
for n, line in enumerate(open('refGene.txt')):
line = line.strip().split('\t')
chr = line[2]
strand = line[3]
if strand == '+':
strand = +1
else:
strand = -1
start = int(line[4]) - 1
stop = int(line[5])
name = line[12]
if 0:
c.execute('INSERT INTO annotations (seq_id, name, start, stop, orientation) VALUES (?, ?, ?, ?, ?);',
(chr, str(name), start, stop, strand))
print chr, name, start, stop, strand
else:
e_starts = map(int, line[9].split(',')[:-1])
e_stops = map(int, line[10].split(',')[:-1])
for i, (start, stop) in enumerate(zip(e_starts, e_stops)):
print n, i
e_name = name + '.%d' % (i + 1)
c.execute('INSERT INTO annotations (seq_id, name, start, stop, orientation) VALUES (?, ?, ?, ?, ?);',
(chr, str(e_name), start, stop, strand))
print 'committing'
db.commit()
slicedb = sqlgraph.SQLTable('annotations',
serverInfo=sqlgraph.SQLiteServerInfo('/scratch/titus/chick/refGene.sqlite'))
annodb = annotation.AnnotationDB(slicedb, genome, annotationType='ref:',
sliceAttrDict=dict(id='seq_id'))
if 0:
for k in annodb:
try:
print k, repr(annodb[k]), slicedb[k].seq_id
except:
raise
###
print 'adding'
al = cnestedlist.NLMSA('refGene.al', 'w', pairwiseMode=True)
for k in annodb:
al.addAnnotation(annodb[k])
al.build(saveSeqDict=True)
print 'building'