Sorry for the delayed response...

> Sure, why not -- these are all just one-to-many relations, right?  As
> long as the mapping information for each step (gene --> mRNAs; mRNA --
>  > exons) exists somewhere, we should be able to tell Pygr how to get
> it and use it.  My first guess is we would model this as a one-to-many
> mapping with no edge information.


> First question: where are these mappings stored?  In a database
> table?  If the schema is something straightforward (e.g. for gene -->
> mRNA, a table with an mRNA unique ID and a gene ID foreign key value),
> it should be easy to plug that in to Pygr using one of its standard
> classes like sqlgraph.SQLGraph.


All of the info is in gff format for me.  See below.

If theGFFmappings are stored in flat files (instead of a database),
> I guess we'd have some extra work to provide truly scalable solutions
>
> for:
> - parsing the file
> - building an on-disk index (using something like shelve or sqlite) so
> we can work with these datasets in Python without always having the
> first step be "load all the data into memory".  Python's support for
> shelve seems to be waning, so sqlite seems like the way to go.  That
> would take us right back to the database case I outlined above, since
> Pygr's sqlgraph classes work transparently with sqlite / MySQL...
>
> If the data are in a flat file, could you provide some example code
> for simply parsing it?  If you just show me how you would load the
> data into memory, I can add code for turning it into an sqlite index
> (a one-time operation), which would then be usable by SQLGraph.
>

I think that sqlite would be a great choice.  My current method is to build
an annotationDB representing the entire gff file(s) with integer counts as
indices. For one chromosome, there are 18k rows and parsing all of the
attributes, etc takes about 10 seconds, so this may not be the most scalable
option.  To me, the initial wait time is well worth having the full set of
gff data available to me.  The parser I'm using is an adaptation from
Titus', so it's a csvDict reader.  After reading in the gff file and
creating an annotationDB (saved to disk), I create a pairwise alignment to
the genome and a schema relationship.  With this limited dataset, I really
haven't noticed any lag time loading in the annotationDB or querying
regions.  See attached for example code for parsing and how I'm using it.

Example usage:

python build_genes_from_gff.py -g Bio.Seq.Genome.TRICA.triCas3 -a
Test.Annotation.TRICA.triCas3.officialGenes -p
/home/wbiesing/sandbox/test_triCas3 -m
Test.Annotation.TRICA.triCas3.BeetleBase.officialGenesMap -b officialGenes
/home/wbiesing/sandbox/analysis/tricas3/official_genes/*.gff3



> It would be great to have all of this information available via pygr.
>
> Let's make it happen!


Great!  Thanks a ton!

--
Jake Biesinger

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

"""GFF parser.

See http://www.sequenceontology.org/gff3.shtml.

Note, for gap format, see

http://may2005.archive.ensembl.org/Docs/wiki/html/EnsemblDocs/CigarFormat.html

read(fp) is the main function provided by this module.
"""

import csv
import urllib

DELIMITER='\t'
FIELDNAMES = ['seqid', 'source', 'type', 'start', 'end', 'score',
              'strand', 'phase', 'attributes' ]

class Bag(dict):
    """dict-like class that supports attribute access as well as getitem.

    >>> x = Bag()
    >>> x['foo'] = 'bar'
    >>> x.foo
    'bar'
    
    """
    def __init__(self, *args, **kw):
        dict.__init__(self, *args, **kw)
        for k in self.keys():
            self.__dict__[k] = self.__getitem__(k)

    def __setitem__(self, k, v):
        dict.__setitem__(self, k, v)
        self.__dict__[k] = v

def read(fp, parse_attributes=True):
    """Parse the given fp as GFF3, yielding Bag objects containing row info.
    """
    
    reader = csv.DictReader(fp, delimiter=DELIMITER, fieldnames=FIELDNAMES)

    for row in reader:
        if row['seqid'].startswith('#'):
            continue

        # unquote all of the other values & store them in a new dict
        row_d = Bag([ (k, urllib.unquote(v)) for (k, v) in row.items() \
                       if k != 'attributes' ])

        # convert int coords
        row_d['start'] = int(row_d['start'])
        row_d['end'] = int(row_d['end'])

        if parse_attributes:
            # pick out the attributes and deal with them specially:
            #  - only unquote after splitting on ;
            #  - make them into a dictionary

            attr = row['attributes']
            attr = attr.split(';')
            attr_d = Bag()
            for kv in attr:
                k, v = kv.split('=', 1)
                k, v = urllib.unquote(k), urllib.unquote(v)
                attr_d[k] = v

            # save attributes back into it
            row_d['attributes'] = attr_d
        else:
            # we don't want to unquote them if we're not parsing.
            row_d['attributes'] = row['attributes']
        
        # done!
        yield row_d

def read_for_pygr(fp, parse_attributes=True):
    """Parse the given fp as GFF3, yielding Bag objects containing row info.
    Modified to suit pygr notation in orientation and naming convention.

    Note: row_d['seqid'] is assumed to be the chromosome for this feature
    """
    for row_d in read(fp, parse_attributes=parse_attributes):
        row_d['id'] = row_d['seqid']

        if row_d['strand'] == '-':
            # invert coordinates for - strand
            row_d['stop'] = -row_d['start']
            row_d['start'] = -row_d['end'] - 1
            row_d['orientation'] = -1
        else:
            row_d['start'] = row_d['start'] - 1
            row_d['stop'] = row_d['end']
            row_d['orientation'] = 1

        yield row_d


parent_collector_fn = lambda x: x.attributes.Parent
    
def read_groups(fp, collector_fn=None):
    """Read collections, grouping by provided criterion.  Default is Parent.

    'collector_fn' is an optional callable that returns a cookie; this
    cookie is used to decide whether or not to start a new collection.  It
    should return something with reasonable '==' behavior (e.g. a string!)
    and should never return None.

    Note, groupings must be contiguous within file.
    
    Returns lists of Bag objects representing each row.
    """
    if collector_fn is None:
        collector_fn = parent_collector_fn
        
    parent = None
    collect = []
    for row in read(fp):
        if parent != collector_fn(row):
            if collect:
                yield collect
            collect = []
            parent = collector_fn(row)
            
        collect.append(row)

    if collect:
        yield collect

def parse_target(row):
    """Extract Target attr information from the given row."""
    target = row.attributes.Target.split(' ')
    name, start, stop = target[:3]
    ori = '+'
    if len(target) == 4:
        ori = target[3]

    return name, int(start), int(stop), ori

if __name__ == '__main__':
    import doctest
    doctest.testmod()
    

#!/usr/bin/env python
"""Build an annotationDB from the given gff file

"""

import optparse
import sys

from pygr import worldbase
from pygr import annotation
from pygr import cnestedlist
from pygr import metabase

from gffparser import read_for_pygr

def main():
    """Build an annotation from the given gff file
    """
    
    usage = """Build and save the annotations defined in the given gff files
    Saves an annotationDB (representing the file itself) and creates a mapping 
    in the form genome[chromosome][100:200].officialGenes"""
    parser = optparse.OptionParser("%prog [options] data1.gff [data2.gff ...]\n"+usage)
    parser.add_option("--genome_resource", '-g', dest="genome_resource", type="string",
                      help="""The pygr resource for the genome, eg, 'Bio.Seq.Genome.TRICA.triCas3'""")
    parser.add_option("--annotationDB_resource", '-a', dest="annotationDB_resource", type="string",
                      help="""Where to save the created annotationDB. eg, 
                      Bio.Annotation.TRICA.triCas3.officialGenes""")
    parser.add_option("--save_pathstem", '-p', dest="pathstem", type="string", 
                      help="""The file to save the exon resource to, eg,
                    '/home/baldig/projects/genomics/pygrdata/annotations/fly/triCas3_official_genes'""")
    parser.add_option("--map_resource", '-m', dest="map_resource", type="string",
                      help="""the resource to save the annotationDB->Genome map,
                      saved both to worldbase and to worldbase.schema, eg,
                      'Bio.Annotation.TRICA.triCas3.BeetleBase.officialGenesMap""")
    parser.add_option("--bind_attribute", '-b', dest="bind_attribute", type="string", 
                      help="""The attribute to access annotationDB from genome region, eg, 
                      'officialGenes' would be accessible via triCas3['ChLG2'][100:200].officialGenes 
                      Default is not to bind an attribute to genome""")


    (opts, args) = parser.parse_args()

    if len(args) < 1: 
        parser.print_help()
        print 'Please specify at least one gff file to read'
        sys.exit(-1)
    if None in [opts.genome_resource, opts.annotationDB_resource, opts.pathstem, opts.map_resource]:
        parser.print_help()
        print 'Required options: genome_resource, annotationDB_resource, pathstem, map_resource'
        sys.exit(-1)
    
    print '# Loading original genome db'
    genome = worldbase(opts.genome_resource)
    annotDB = annotation.AnnotationDB(None, genome, opts.bind_attribute, 
                                        filename=opts.pathstem + '_annotDB', mode='c', verbose=False)
    nlmsa = cnestedlist.NLMSA(opts.pathstem, 'w', pairwiseMode=True, bidirectional=False)

    index = 0  # unique ID used in annotationD
    for filename in args:
        print '# adding to annotationDB from %s' % filename
        fileIn = open(filename)
        for row in read_for_pygr(fileIn):
            curAnnot = annotDB.new_annotation(index, row)
            nlmsa.addAnnotation(curAnnot)
            index += 1
    annotDB.close() # Flush annotation data to disk
    
    print '# building NLMSA from all gff files'
    nlmsa.build(saveSeqDict=True)
    print '# saving annotationDB and NLMSA to worldbase as %s and %s' % (opts.annotationDB_resource,
                                                                        opts.map_resource)
    annotDB.__doc__ = 'Combined gff annotationDB from files %s on genome %s' % (', '.join(args), 
                                                                                opts.genome_resource)
    nlmsa.__doc__ = 'Mapping of %s, from gff files %s onto genome %s' % (opts.annotationDB_resource,
                                                                            ', '.join(args),
                                                                            opts.genome_resource)
    worldbase.add_resource(opts.annotationDB_resource, annotDB)
    worldbase.add_resource(opts.map_resource, nlmsa)

    if opts.bind_attribute:
        print '# saving worldbase schema with bindAttrs=(%s)' % opts.bind_attribute
        genome_annotDB_relation = metabase.ManyToManyRelation(genome, annotDB, bindAttrs=(opts.bind_attribute,))
        genome_annotDB_relation.__doc__ = 'GFF based mapping from %s to genome %s' % (opts.annotationDB_resource,
                                                                                        opts.genome_resource)
        worldbase.add_schema('%s' % opts.map_resource, genome_annotDB_relation)
                                
    
    print '# committing worldbase resources'
    worldbase.commit()

if __name__ == "__main__":
    main()
    

Reply via email to