[pygr] Re: not reinventing the wheel on Annotations

C. Titus Brown Tue, 25 Aug 2009 15:54:17 -0700

On Tue, Aug 25, 2009 at 03:46:54PM -0700, Christopher Lee wrote:
-> On Aug 25, 2009, at 3:21 PM, C. Titus Brown wrote:
-> 
-> > -> I don't know if this is the same idea, but it would be great to  
-> > mirror
-> > -> some portion of gff files natively in pygr-- gene entries have mRNA
-> > -> children which have exon children.  That's a natural graph
-> > -> relationship which would be great to have available in pygr.  Is  
-> > there
-> > -> code already to do this kind of thing? If not, what would be the  
-> > best
-> > -> approach? I can think of a few...
-> > ->
-> > ->  * Read the gff file and store its rows as 'Bags' as Titus did in  
-> > his
-> > -> gff parser-- have myGene.exons map to the 'exon' entries that are
-> > -> children of myGene.  Let pygr pickle this as a dictionary of genes.
-> > -> This would mean having a very large dictionary in-memory whenever  
-> > you
-> > -> wanted to use the annotations.
-> > ->  * Create an sqlite or mysql db from the gff file and make that
-> > -> resource available to pygr.
-> > ->
-> > -> Is there a better way to do this?
-> >
-> > I think the latter approach is the right one, and I'm hoping that  
-> > one of
-> > the tutorials-in-progress will address how to do this.  If not I'll  
-> > lead
-> > a counter-charge; this is something I still don't understand how to do
-> > in pygr, and I need to do it!
-> 
-> I think Titus is referring to me when he says "tutorials-in- 
-> progress"...  I will be happy to include this, but it would help me a  
-> lot to get started on a GFF example if someone could send me their  
-> example code for parsing GFF annotations (e.g. exons) from a file (or  
-> a database or whatever is a typical place people are already storing  
-> their GFF annotations in).  I'd prefer to be able to focus my Pygr  
-> tutorial effort on illustrating how to use Pygr classes, rather than  
-> researching the GFF formats (which I haven't used yet).


Hey Chris,

I can adapt whatever you write to cover some form of GFF; every GFF
format is different in how it deals with exons, AFAICT.  (OK, a slight
exaggeration ;)

How about using this as an example:

http://hgdownload.cse.ucsc.edu/goldenPath/galGal3/database/refGene.txt.gz

and adapting the attached code?  It's not my cleanest code ever (to say
the least) but it should save you from having to muck around with the
file formats.

cheers,
--titus
-- 
C. Titus Brown, [email protected]

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

from pygr import seqdb, cnestedlist, annotation, sqlgraph
import sqlite3

genome = seqdb.SequenceFileDB('/scratch/titus/chick/chick.masked.fa')

db = sqlite3.connect('refGene.sqlite')
c = db.cursor()

c.execute('CREATE TABLE annotations (k INTEGER PRIMARY KEY, name TEXT, seq_id TEXT, start INT, stop INT, orientation INT);')

for n, line in enumerate(open('refGene.txt')):
    line = line.strip().split('\t')
    chr = line[2]
    strand = line[3]
    if strand == '+':
        strand = +1
    else:
        strand = -1

    start = int(line[4]) - 1
    stop = int(line[5])
    name = line[12]

    if 0:
        c.execute('INSERT INTO annotations (seq_id, name, start, stop, orientation) VALUES (?, ?, ?, ?, ?);',
                  (chr, str(name), start, stop, strand))

        print chr, name, start, stop, strand

    else:
        e_starts = map(int, line[9].split(',')[:-1])
        e_stops = map(int, line[10].split(',')[:-1])

        for i, (start, stop) in enumerate(zip(e_starts, e_stops)):
            print n, i
            e_name = name + '.%d' % (i + 1)
            c.execute('INSERT INTO annotations (seq_id, name, start, stop, orientation) VALUES (?, ?, ?, ?, ?);',
                  (chr, str(e_name), start, stop, strand))

print 'committing'
db.commit()

slicedb = sqlgraph.SQLTable('annotations',
                            serverInfo=sqlgraph.SQLiteServerInfo('/scratch/titus/chick/refGene.sqlite'))

annodb = annotation.AnnotationDB(slicedb, genome, annotationType='ref:',
                                 sliceAttrDict=dict(id='seq_id'))

if 0:
    for k in annodb:
        try:
            print k, repr(annodb[k]), slicedb[k].seq_id
        except:
            raise

###

print 'adding'
al = cnestedlist.NLMSA('refGene.al', 'w', pairwiseMode=True)
for k in annodb:
    al.addAnnotation(annodb[k])
al.build(saveSeqDict=True)
print 'building'

[pygr] Re: not reinventing the wheel on Annotations

Reply via email to