This is an automated email from the git hooks/post-receive script. tille pushed a commit to branch master in repository python-pyvcf.
commit de030e81448bd9adaf5d2cc63b8f12565677d2e9 Author: Andreas Tille <[email protected]> Date: Fri Dec 2 20:19:03 2016 +0100 New upstream version 0.6.8 --- PKG-INFO | 187 +------------ README.rst | 33 ++- scripts/vcf_sample_filter.py | 39 +++ setup.cfg | 2 +- setup.py | 40 +-- vcf/__init__.py | 175 +----------- vcf/cparse.pyx | 5 +- vcf/model.py | 146 +++++++++-- vcf/parser.py | 193 ++++++++------ vcf/sample_filter.py | 115 ++++++++ vcf/test/FT.vcf | 50 ++++ vcf/test/issue-214.vcf | 32 +++ vcf/test/strelka.vcf | 57 ++++ vcf/test/test_vcf.py | 613 +++++++++++++++++++++++++++++++++++++++++-- vcf/utils.py | 2 +- 15 files changed, 1168 insertions(+), 521 deletions(-) diff --git a/PKG-INFO b/PKG-INFO index 5ce6750..394a9cc 100644 --- a/PKG-INFO +++ b/PKG-INFO @@ -1,190 +1,27 @@ Metadata-Version: 1.1 Name: PyVCF -Version: 0.6.7 +Version: 0.6.8 Summary: Variant Call Format (VCF) parser for Python Home-page: https://github.com/jamescasbon/PyVCF Author: James Casbon and @jdoughertyii Author-email: [email protected] License: UNKNOWN -Description: A VCFv4.0 and 4.1 parser for Python. - - Online version of PyVCF documentation is available at http://pyvcf.rtfd.org/ - - The intent of this module is to mimic the ``csv`` module in the Python stdlib, - as opposed to more flexible serialization formats like JSON or YAML. ``vcf`` - will attempt to parse the content of each record based on the data types - specified in the meta-information lines -- specifically the ##INFO and - ##FORMAT lines. If these lines are missing or incomplete, it will check - against the reserved types mentioned in the spec. Failing that, it will just - return strings. - - There main interface is the class: ``Reader``. It takes a file-like - object and acts as a reader:: - - >>> import vcf - >>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r')) - >>> for record in vcf_reader: - ... print record - Record(CHROM=20, POS=14370, REF=G, ALT=[A]) - Record(CHROM=20, POS=17330, REF=T, ALT=[A]) - Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T]) - Record(CHROM=20, POS=1230237, REF=T, ALT=[None]) - Record(CHROM=20, POS=1234567, REF=GTCT, ALT=[G, GTACT]) - - - This produces a great deal of information, but it is conveniently accessed. - The attributes of a Record are the 8 fixed fields from the VCF spec:: - - * ``Record.CHROM`` - * ``Record.POS`` - * ``Record.ID`` - * ``Record.REF`` - * ``Record.ALT`` - * ``Record.QUAL`` - * ``Record.FILTER`` - * ``Record.INFO`` - - plus attributes to handle genotype information: - - * ``Record.FORMAT`` - * ``Record.samples`` - * ``Record.genotype`` - - ``samples`` and ``genotype``, not being the title of any column, are left lowercase. The format - of the fixed fields is from the spec. Comma-separated lists in the VCF are - converted to lists. In particular, one-entry VCF lists are converted to - one-entry Python lists (see, e.g., ``Record.ALT``). Semicolon-delimited lists - of key=value pairs are converted to Python dictionaries, with flags being given - a ``True`` value. Integers and floats are handled exactly as you'd expect:: - - >>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r')) - >>> record = vcf_reader.next() - >>> print record.POS - 14370 - >>> print record.ALT - [A] - >>> print record.INFO['AF'] - [0.5] - - There are a number of convienience methods and properties for each ``Record`` allowing you to - examine properties of interest:: - - >>> print record.num_called, record.call_rate, record.num_unknown - 3 1.0 0 - >>> print record.num_hom_ref, record.num_het, record.num_hom_alt - 1 1 1 - >>> print record.nucl_diversity, record.aaf, record.heterozygosity - 0.6 [0.5] 0.5 - >>> print record.get_hets() - [Call(sample=NA00002, CallData(GT=1|0, GQ=48, DP=8, HQ=[51, 51]))] - >>> print record.is_snp, record.is_indel, record.is_transition, record.is_deletion - True False True False - >>> print record.var_type, record.var_subtype - snp ts - >>> print record.is_monomorphic - False - - ``record.FORMAT`` will be a string specifying the format of the genotype - fields. In case the FORMAT column does not exist, ``record.FORMAT`` is - ``None``. Finally, ``record.samples`` is a list of dictionaries containing the - parsed sample column and ``record.genotype`` is a way of looking up genotypes - by sample name:: - - >>> record = vcf_reader.next() - >>> for sample in record.samples: - ... print sample['GT'] - 0|0 - 0|1 - 0/0 - >>> print record.genotype('NA00001')['GT'] - 0|0 - - The genotypes are represented by ``Call`` objects, which have three attributes: the - corresponding Record ``site``, the sample name in ``sample`` and a dictionary of - call data in ``data``:: - - >>> call = record.genotype('NA00001') - >>> print call.site - Record(CHROM=20, POS=17330, REF=T, ALT=[A]) - >>> print call.sample - NA00001 - >>> print call.data - CallData(GT=0|0, GQ=49, DP=3, HQ=[58, 50]) - - Please note that as of release 0.4.0, attributes known to have single values (such as - ``DP`` and ``GQ`` above) are returned as values. Other attributes are returned - as lists (such as ``HQ`` above). - - There are also a number of methods:: - - >>> print call.called, call.gt_type, call.gt_bases, call.phased - True 0 T|T True - - Metadata regarding the VCF file itself can be investigated through the - following attributes: - - * ``Reader.metadata`` - * ``Reader.infos`` - * ``Reader.filters`` - * ``Reader.formats`` - * ``Reader.samples`` - - For example:: - - >>> vcf_reader.metadata['fileDate'] - '20090805' - >>> vcf_reader.samples - ['NA00001', 'NA00002', 'NA00003'] - >>> vcf_reader.filters - OrderedDict([('q10', Filter(id='q10', desc='Quality below 10')), ('s50', Filter(id='s50', desc='Less than 50% of samples have data'))]) - >>> vcf_reader.infos['AA'].desc - 'Ancestral Allele' - - ALT records are actually classes, so that you can interrogate them:: - - >>> reader = vcf.Reader(open('vcf/test/example-4.1-bnd.vcf')) - >>> _ = reader.next(); row = reader.next() - >>> print row - Record(CHROM=1, POS=2, REF=T, ALT=[T[2:3[]) - >>> bnd = row.ALT[0] - >>> print bnd.withinMainAssembly, bnd.orientation, bnd.remoteOrientation, bnd.connectingSequence - True False True T - - Random access is supported for files with tabix indexes. Simply call fetch for the - region you are interested in:: - - >>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz') - >>> for record in vcf_reader.fetch('20', 1110696, 1230237): # doctest: +SKIP - ... print record - Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T]) - Record(CHROM=20, POS=1230237, REF=T, ALT=[None]) - - Or extract a single row:: - - >>> print vcf_reader.fetch('20', 1110696) # doctest: +SKIP - Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T]) - - - The ``Writer`` class provides a way of writing a VCF file. Currently, you must specify a - template ``Reader`` which provides the metadata:: - - >>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz') - >>> vcf_writer = vcf.Writer(open('/dev/null', 'w'), vcf_reader) - >>> for record in vcf_reader: - ... vcf_writer.write_record(record) - - - An extensible script is available to filter vcf files in vcf_filter.py. VCF filters - declared by other packages will be available for use in this script. Please - see :doc:`FILTERS` for full description. - - +Description: UNKNOWN Keywords: bioinformatics Platform: UNKNOWN Classifier: Development Status :: 4 - Beta Classifier: Intended Audience :: Developers Classifier: Intended Audience :: Science/Research +Classifier: License :: OSI Approved :: BSD License +Classifier: License :: OSI Approved :: MIT License Classifier: Operating System :: OS Independent +Classifier: Programming Language :: Cython Classifier: Programming Language :: Python +Classifier: Programming Language :: Python :: 2 +Classifier: Programming Language :: Python :: 2.6 +Classifier: Programming Language :: Python :: 2.7 Classifier: Programming Language :: Python :: 3 -Classifier: Topic :: Scientific/Engineering +Classifier: Programming Language :: Python :: 3.2 +Classifier: Programming Language :: Python :: 3.3 +Classifier: Programming Language :: Python :: 3.4 +Classifier: Topic :: Scientific/Engineering :: Bio-Informatics diff --git a/README.rst b/README.rst index a60c0c8..67b5d1b 100644 --- a/README.rst +++ b/README.rst @@ -50,7 +50,7 @@ of key=value pairs are converted to Python dictionaries, with flags being given a ``True`` value. Integers and floats are handled exactly as you'd expect:: >>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r')) - >>> record = vcf_reader.next() + >>> record = next(vcf_reader) >>> print record.POS 14370 >>> print record.ALT @@ -58,7 +58,7 @@ a ``True`` value. Integers and floats are handled exactly as you'd expect:: >>> print record.INFO['AF'] [0.5] -There are a number of convienience methods and properties for each ``Record`` allowing you to +There are a number of convenience methods and properties for each ``Record`` allowing you to examine properties of interest:: >>> print record.num_called, record.call_rate, record.num_unknown @@ -82,7 +82,7 @@ fields. In case the FORMAT column does not exist, ``record.FORMAT`` is parsed sample column and ``record.genotype`` is a way of looking up genotypes by sample name:: - >>> record = vcf_reader.next() + >>> record = next(vcf_reader) >>> for sample in record.samples: ... print sample['GT'] 0|0 @@ -135,27 +135,38 @@ For example:: ALT records are actually classes, so that you can interrogate them:: >>> reader = vcf.Reader(open('vcf/test/example-4.1-bnd.vcf')) - >>> _ = reader.next(); row = reader.next() + >>> _ = next(reader); row = next(reader) >>> print row Record(CHROM=1, POS=2, REF=T, ALT=[T[2:3[]) >>> bnd = row.ALT[0] >>> print bnd.withinMainAssembly, bnd.orientation, bnd.remoteOrientation, bnd.connectingSequence True False True T -Random access is supported for files with tabix indexes. Simply call fetch for the -region you are interested in:: +The Reader supports retrieval of records within designated regions for files +with tabix indexes via the fetch method. This requires the pysam module as a +dependency. Pass in a chromosome, and, optionally, start and end coordinates, +for the regions of interest:: >>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz') - >>> for record in vcf_reader.fetch('20', 1110696, 1230237): # doctest: +SKIP + >>> # fetch all records on chromosome 20 from base 1110696 through 1230237 + >>> for record in vcf_reader.fetch('20', 1110695, 1230237): # doctest: +SKIP ... print record Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T]) Record(CHROM=20, POS=1230237, REF=T, ALT=[None]) -Or extract a single row:: +Note that the start and end coordinates are in the zero-based, half-open +coordinate system, similar to ``_Record.start`` and ``_Record.end``. The very +first base of a chromosome is index 0, and the the region includes bases up +to, but not including the base at the end coordinate. For example:: - >>> print vcf_reader.fetch('20', 1110696) # doctest: +SKIP - Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T]) + >>> # fetch all records on chromosome 4 from base 11 through 20 + >>> vcf_reader.fetch('4', 10, 20) # doctest: +SKIP +would include all records overlapping a 10 base pair region from the 11th base +of through the 20th base (which is at index 19) of chromosome 4. It would not +include the 21st base (at index 20). (See +http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms for more +information on the zero-based, half-open coordinate system.) The ``Writer`` class provides a way of writing a VCF file. Currently, you must specify a template ``Reader`` which provides the metadata:: @@ -165,8 +176,6 @@ template ``Reader`` which provides the metadata:: >>> for record in vcf_reader: ... vcf_writer.write_record(record) - An extensible script is available to filter vcf files in vcf_filter.py. VCF filters declared by other packages will be available for use in this script. Please see :doc:`FILTERS` for full description. - diff --git a/scripts/vcf_sample_filter.py b/scripts/vcf_sample_filter.py new file mode 100644 index 0000000..d71e6a3 --- /dev/null +++ b/scripts/vcf_sample_filter.py @@ -0,0 +1,39 @@ +#!/usr/bin/env python + +# Author: Lenna X. Peterson +# github.com/lennax +# arklenna at gmail dot com + +import argparse +import logging + +from vcf import SampleFilter + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("file", help="VCF file to filter") + parser.add_argument("-o", metavar="outfile", + help="File to write out filtered samples") + parser.add_argument("-f", metavar="filters", + help="Comma-separated list of sample indices or names \ + to filter") + parser.add_argument("-i", "--invert", action="store_true", + help="Keep rather than discard the filtered samples") + parser.add_argument("-q", "--quiet", action="store_true", + help="Less output") + + args = parser.parse_args() + + if args.quiet: + log_level = logging.WARNING + else: + log_level = logging.INFO + logging.basicConfig(format='%(message)s', level=log_level) + + sf = SampleFilter(infile=args.file, outfile=args.o, + filters=args.f, invert=args.invert) + if args.f is None: + print "Samples:" + for idx, val in enumerate(sf.samples): + print "{0}: {1}".format(idx, val) diff --git a/setup.cfg b/setup.cfg index 861a9f5..6bc2ff3 100644 --- a/setup.cfg +++ b/setup.cfg @@ -1,5 +1,5 @@ [egg_info] -tag_build = tag_date = 0 +tag_build = tag_svn_revision = 0 diff --git a/setup.py b/setup.py index 1ea709c..d8089c0 100644 --- a/setup.py +++ b/setup.py @@ -1,6 +1,6 @@ from setuptools import setup -from distutils.core import setup from distutils.extension import Extension +import sys try: from Cython.Distutils import build_ext @@ -8,23 +8,14 @@ try: except: CYTHON = False -requires = [] +IS_PYTHON26 = sys.version_info[:2] == (2, 6) -# python 2.6 does not have argparse -try: - import argparse -except ImportError: - requires.append('argparse') +DEPENDENCIES = ['setuptools'] + +if IS_PYTHON26: + DEPENDENCIES.extend(['argparse', 'counter', 'ordereddict', + 'unittest2']) -import collections -try: - collections.Counter -except AttributeError: - requires.append('counter') -try: - collections.OrderedDict -except AttributeError: - requires.append('ordereddict') # get the version without an import VERSION = "Undefined" @@ -47,14 +38,14 @@ if CYTHON: setup( name='PyVCF', packages=['vcf', 'vcf.test'], - scripts=['scripts/vcf_melt', 'scripts/vcf_filter.py'], + scripts=['scripts/vcf_melt', 'scripts/vcf_filter.py', + 'scripts/vcf_sample_filter.py'], author='James Casbon and @jdoughertyii', author_email='[email protected]', description='Variant Call Format (VCF) parser for Python', long_description=DOC, test_suite='vcf.test.test_vcf.suite', - install_requires=['distribute'], - requires=requires, + install_requires=DEPENDENCIES, entry_points = { 'vcf.filters': [ 'site_quality = vcf.filters:SiteQuality', @@ -71,10 +62,19 @@ setup( 'Development Status :: 4 - Beta', 'Intended Audience :: Developers', 'Intended Audience :: Science/Research', + 'License :: OSI Approved :: BSD License', + 'License :: OSI Approved :: MIT License', 'Operating System :: OS Independent', + 'Programming Language :: Cython', 'Programming Language :: Python', + 'Programming Language :: Python :: 2', + 'Programming Language :: Python :: 2.6', + 'Programming Language :: Python :: 2.7', 'Programming Language :: Python :: 3', - 'Topic :: Scientific/Engineering', + 'Programming Language :: Python :: 3.2', + 'Programming Language :: Python :: 3.3', + 'Programming Language :: Python :: 3.4', + 'Topic :: Scientific/Engineering :: Bio-Informatics', ], keywords='bioinformatics', use_2to3=True, diff --git a/vcf/__init__.py b/vcf/__init__.py index 875e2d4..e1aae58 100644 --- a/vcf/__init__.py +++ b/vcf/__init__.py @@ -1,180 +1,15 @@ #!/usr/bin/env python -'''A VCFv4.0 and 4.1 parser for Python. +""" +A VCFv4.0 and 4.1 parser for Python. Online version of PyVCF documentation is available at http://pyvcf.rtfd.org/ +""" -The intent of this module is to mimic the ``csv`` module in the Python stdlib, -as opposed to more flexible serialization formats like JSON or YAML. ``vcf`` -will attempt to parse the content of each record based on the data types -specified in the meta-information lines -- specifically the ##INFO and -##FORMAT lines. If these lines are missing or incomplete, it will check -against the reserved types mentioned in the spec. Failing that, it will just -return strings. -There main interface is the class: ``Reader``. It takes a file-like -object and acts as a reader:: - - >>> import vcf - >>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r')) - >>> for record in vcf_reader: - ... print record - Record(CHROM=20, POS=14370, REF=G, ALT=[A]) - Record(CHROM=20, POS=17330, REF=T, ALT=[A]) - Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T]) - Record(CHROM=20, POS=1230237, REF=T, ALT=[None]) - Record(CHROM=20, POS=1234567, REF=GTCT, ALT=[G, GTACT]) - - -This produces a great deal of information, but it is conveniently accessed. -The attributes of a Record are the 8 fixed fields from the VCF spec:: - - * ``Record.CHROM`` - * ``Record.POS`` - * ``Record.ID`` - * ``Record.REF`` - * ``Record.ALT`` - * ``Record.QUAL`` - * ``Record.FILTER`` - * ``Record.INFO`` - -plus attributes to handle genotype information: - - * ``Record.FORMAT`` - * ``Record.samples`` - * ``Record.genotype`` - -``samples`` and ``genotype``, not being the title of any column, are left lowercase. The format -of the fixed fields is from the spec. Comma-separated lists in the VCF are -converted to lists. In particular, one-entry VCF lists are converted to -one-entry Python lists (see, e.g., ``Record.ALT``). Semicolon-delimited lists -of key=value pairs are converted to Python dictionaries, with flags being given -a ``True`` value. Integers and floats are handled exactly as you'd expect:: - - >>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r')) - >>> record = vcf_reader.next() - >>> print record.POS - 14370 - >>> print record.ALT - [A] - >>> print record.INFO['AF'] - [0.5] - -There are a number of convienience methods and properties for each ``Record`` allowing you to -examine properties of interest:: - - >>> print record.num_called, record.call_rate, record.num_unknown - 3 1.0 0 - >>> print record.num_hom_ref, record.num_het, record.num_hom_alt - 1 1 1 - >>> print record.nucl_diversity, record.aaf, record.heterozygosity - 0.6 [0.5] 0.5 - >>> print record.get_hets() - [Call(sample=NA00002, CallData(GT=1|0, GQ=48, DP=8, HQ=[51, 51]))] - >>> print record.is_snp, record.is_indel, record.is_transition, record.is_deletion - True False True False - >>> print record.var_type, record.var_subtype - snp ts - >>> print record.is_monomorphic - False - -``record.FORMAT`` will be a string specifying the format of the genotype -fields. In case the FORMAT column does not exist, ``record.FORMAT`` is -``None``. Finally, ``record.samples`` is a list of dictionaries containing the -parsed sample column and ``record.genotype`` is a way of looking up genotypes -by sample name:: - - >>> record = vcf_reader.next() - >>> for sample in record.samples: - ... print sample['GT'] - 0|0 - 0|1 - 0/0 - >>> print record.genotype('NA00001')['GT'] - 0|0 - -The genotypes are represented by ``Call`` objects, which have three attributes: the -corresponding Record ``site``, the sample name in ``sample`` and a dictionary of -call data in ``data``:: - - >>> call = record.genotype('NA00001') - >>> print call.site - Record(CHROM=20, POS=17330, REF=T, ALT=[A]) - >>> print call.sample - NA00001 - >>> print call.data - CallData(GT=0|0, GQ=49, DP=3, HQ=[58, 50]) - -Please note that as of release 0.4.0, attributes known to have single values (such as -``DP`` and ``GQ`` above) are returned as values. Other attributes are returned -as lists (such as ``HQ`` above). - -There are also a number of methods:: - - >>> print call.called, call.gt_type, call.gt_bases, call.phased - True 0 T|T True - -Metadata regarding the VCF file itself can be investigated through the -following attributes: - - * ``Reader.metadata`` - * ``Reader.infos`` - * ``Reader.filters`` - * ``Reader.formats`` - * ``Reader.samples`` - -For example:: - - >>> vcf_reader.metadata['fileDate'] - '20090805' - >>> vcf_reader.samples - ['NA00001', 'NA00002', 'NA00003'] - >>> vcf_reader.filters - OrderedDict([('q10', Filter(id='q10', desc='Quality below 10')), ('s50', Filter(id='s50', desc='Less than 50% of samples have data'))]) - >>> vcf_reader.infos['AA'].desc - 'Ancestral Allele' - -ALT records are actually classes, so that you can interrogate them:: - - >>> reader = vcf.Reader(open('vcf/test/example-4.1-bnd.vcf')) - >>> _ = reader.next(); row = reader.next() - >>> print row - Record(CHROM=1, POS=2, REF=T, ALT=[T[2:3[]) - >>> bnd = row.ALT[0] - >>> print bnd.withinMainAssembly, bnd.orientation, bnd.remoteOrientation, bnd.connectingSequence - True False True T - -Random access is supported for files with tabix indexes. Simply call fetch for the -region you are interested in:: - - >>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz') - >>> for record in vcf_reader.fetch('20', 1110696, 1230237): # doctest: +SKIP - ... print record - Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T]) - Record(CHROM=20, POS=1230237, REF=T, ALT=[None]) - -Or extract a single row:: - - >>> print vcf_reader.fetch('20', 1110696) # doctest: +SKIP - Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T]) - - -The ``Writer`` class provides a way of writing a VCF file. Currently, you must specify a -template ``Reader`` which provides the metadata:: - - >>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz') - >>> vcf_writer = vcf.Writer(open('/dev/null', 'w'), vcf_reader) - >>> for record in vcf_reader: - ... vcf_writer.write_record(record) - - -An extensible script is available to filter vcf files in vcf_filter.py. VCF filters -declared by other packages will be available for use in this script. Please -see :doc:`FILTERS` for full description. - -''' from vcf.parser import Reader, Writer from vcf.parser import VCFReader, VCFWriter from vcf.filters import Base as Filter from vcf.parser import RESERVED_INFO, RESERVED_FORMAT +from vcf.sample_filter import SampleFilter -VERSION = '0.6.7' +VERSION = '0.6.8' diff --git a/vcf/cparse.pyx b/vcf/cparse.pyx index 682e6a7..8a71d64 100644 --- a/vcf/cparse.pyx +++ b/vcf/cparse.pyx @@ -36,7 +36,10 @@ def parse_samples( vals = sampvals[j] # short circuit the most common - if vals == '.' or vals == './.': + if samp_fmt._fields[j] == 'GT': + sampdat[j] = vals + continue + elif not vals or vals == '.': sampdat[j] = None continue diff --git a/vcf/model.py b/vcf/model.py index c6e8f42..ef1edb7 100644 --- a/vcf/model.py +++ b/vcf/model.py @@ -1,33 +1,39 @@ from abc import ABCMeta, abstractmethod import collections import sys +import re try: from collections import Counter except ImportError: from counter import Counter +allele_delimiter = re.compile(r'''[|/]''') # to split a genotype into alleles class _Call(object): """ A genotype call, a cell entry in a VCF file""" - __slots__ = ['site', 'sample', 'data', 'gt_nums', 'called'] + __slots__ = ['site', 'sample', 'data', 'gt_nums', 'gt_alleles', 'called', 'ploidity'] def __init__(self, site, sample, data): #: The ``_Record`` for this ``_Call`` self.site = site #: The sample name self.sample = sample - #: Dictionary of data from the VCF file + #: Namedtuple of data from the VCF file self.data = data - try: - self.gt_nums = self.data.GT - #: True if the GT is not ./. - self.called = self.gt_nums is not None - except AttributeError: - self.gt_nums = None + + if hasattr(self.data, 'GT'): + self.gt_alleles = [(al if al != '.' else None) for al in allele_delimiter.split(self.data.GT)] + self.ploidity = len(self.gt_alleles) + self.called = all([al != None for al in self.gt_alleles]) + self.gt_nums = self.data.GT if self.called else None + else: #62 a call without a genotype is not defined as called or not + self.gt_alleles = None + self.ploidity = None self.called = None + self.gt_nums = None def __repr__(self): return "Call(sample=%s, %s)" % (self.sample, str(self.data)) @@ -51,12 +57,6 @@ class _Call(object): return "/" if not self.phased else "|" @property - def gt_alleles(self): - '''The numbers of the alleles called at a given sample''' - # grab the numeric alleles of the gt string; tokenize by phasing - return self.gt_nums.split(self.gt_phase_char()) - - @property def gt_bases(self): '''The actual genotype alleles. E.g. if VCF genotype is 0/1, return A/G @@ -125,10 +125,45 @@ class _Record(object): INFO and FORMAT are available as properties. The list of genotype calls is in the ``samples`` property. + + Regarding the coordinates associated with each instance: + + - ``POS``, per VCF specification, is the one-based index + (the first base of the contig has an index of 1) of the first + base of the ``REF`` sequence. + - The ``start`` and ``end`` denote the coordinates of the entire + ``REF`` sequence in the zero-based, half-open coordinate + system (see + http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms), + where the first base of the contig has an index of 0, and the + interval runs up to, but does not include, the base at the + ``end`` index. This indexing scheme is analagous to Python + slice notation. + - The ``affected_start`` and ``affected_end`` coordinates are + also in the zero-based, half-open coordinate system. These + coordinates indicate the precise region of the reference + genome actually affected by the events denoted in ``ALT`` + (i.e., the minimum ``affected_start`` and maximum + ``affected_end``). + + - For SNPs and structural variants, the affected region + includes all bases of ``REF``, including the first base + (i.e., ``affected_start = start = POS - 1``). + - For deletions, the region includes all bases of ``REF`` + except the first base, which flanks upstream the actual + deletion event, per VCF specification. + - For insertions, the ``affected_start`` and ``affected_end`` + coordinates represent a 0 bp-length region between the two + flanking bases (i.e., ``affected_start`` = + ``affected_end``). This is analagous to Python slice + notation (see http://stackoverflow.com/a/2947881/38140). + Neither the upstream nor downstream flanking bases are + included in the region. """ def __init__(self, CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, sample_indexes, samples=None): self.CHROM = CHROM + #: the one-based coordinate of the first nucleotide in ``REF`` self.POS = POS self.ID = ID self.REF = REF @@ -137,9 +172,9 @@ class _Record(object): self.FILTER = FILTER self.INFO = INFO self.FORMAT = FORMAT - #: 0-based start coordinate + #: zero-based, half-open start coordinate of ``REF`` self.start = self.POS - 1 - #: 1-based end coordinate + #: zero-based, half-open end coordinate of ``REF`` self.end = self.start + len(self.REF) #: list of alleles. [0] = REF, [1:] = ALTS self.alleles = [self.REF] @@ -148,6 +183,61 @@ class _Record(object): self.samples = samples or [] self._sample_indexes = sample_indexes + # Setting affected_start and affected_end here for Sphinx + # autodoc purposes... + #: zero-based, half-open start coordinate of affected region of reference genome + self.affected_start = None + #: zero-based, half-open end coordinate of affected region of reference genome (not included in the region) + self.affected_end = None + self._set_start_and_end() + + + def _set_start_and_end(self): + self.affected_start = self.affected_end = self.POS + for alt in self.ALT: + if alt is None: + start, end = self._compute_coordinates_for_none_alt() + elif alt.type == 'SNV': + start, end = self._compute_coordinates_for_snp() + elif alt.type == 'MNV': + start, end = self._compute_coordinates_for_indel() + else: + start, end = self._compute_coordinates_for_sv() + self.affected_start = min(self.affected_start, start) + self.affected_end = max(self.affected_end, end) + + + def _compute_coordinates_for_none_alt(self): + start = self.POS - 1 + end = start + len(self.REF) + return (start, end) + + + def _compute_coordinates_for_snp(self): + if len(self.REF) > 1: + start = self.POS + end = start + (len(self.REF) - 1) + else: + start = self.POS - 1 + end = self.POS + return (start, end) + + + def _compute_coordinates_for_indel(self): + if len(self.REF) > 1: + start = self.POS + end = start + (len(self.REF) - 1) + else: + start = end = self.POS + return (start, end) + + + def _compute_coordinates_for_sv(self): + start = self.POS - 1 + end = start + len(self.REF) + return (start, end) + + # For Python 2 def __cmp__(self, other): return cmp((self.CHROM, self.POS), (getattr(other, "CHROM", None), getattr(other, "POS", None))) @@ -221,12 +311,13 @@ class _Record(object): """ A list of allele frequencies of alternate alleles. NOTE: Denominator calc'ed from _called_ genotypes. """ - num_chroms = 2.0 * self.num_called + num_chroms = 0.0 allele_counts = Counter() for s in self.samples: if s.gt_type is not None: - allele_counts.update([s.gt_alleles[0]]) - allele_counts.update([s.gt_alleles[1]]) + for a in s.gt_alleles: + allele_counts.update([a]) + num_chroms += 1 return [allele_counts[str(i)]/num_chroms for i in range(1, len(self.ALT)+1)] @property @@ -239,7 +330,7 @@ class _Record(object): Derived from: \"Population Genetics: A Concise Guide, 2nd ed., p.45\" - John Gillespie. + John Gillespie. """ # skip if more than one alternate allele. assumes bi-allelic if len(self.ALT) > 1: @@ -285,7 +376,7 @@ class _Record(object): for alt in self.ALT: if alt is None or alt.type != "SNV": return False - if alt not in ['A', 'C', 'G', 'T']: + if alt not in ['A', 'C', 'G', 'T', 'N', '*']: return False return True @@ -376,13 +467,14 @@ class _Record(object): def var_subtype(self): """ Return the subtype of variant. + - For SNPs and INDELs, yeild one of: [ts, tv, ins, del] - - For SVs yield either "complex" or the SV type defined - in the ALT fields (removing the brackets). - E.g.: - <DEL> -> DEL - <INS:ME:L1> -> INS:ME:L1 - <DUP> -> DUP + - For SVs yield either "complex" or the SV type defined in the ALT + fields (removing the brackets). E.g.:: + + <DEL> -> DEL + <INS:ME:L1> -> INS:ME:L1 + <DUP> -> DUP The logic is meant to follow the rules outlined in the following paragraph at: diff --git a/vcf/parser.py b/vcf/parser.py index aac92f8..2cd8deb 100644 --- a/vcf/parser.py +++ b/vcf/parser.py @@ -1,10 +1,11 @@ +import codecs import collections -import re import csv import gzip -import sys import itertools -import codecs +import os +import re +import sys try: from collections import OrderedDict @@ -62,12 +63,13 @@ SINGULAR_METADATA = ['fileformat', 'fileDate', 'reference'] # Conversion between value in file and Python value field_counts = { '.': None, # Unknown number of values - 'A': -1, # Equal to the number of alleles in a given record + 'A': -1, # Equal to the number of alternate alleles in a given record 'G': -2, # Equal to the number of genotypes in a given record + 'R': -3, # Equal to the number of alleles including reference in a given record } -_Info = collections.namedtuple('Info', ['id', 'num', 'type', 'desc']) +_Info = collections.namedtuple('Info', ['id', 'num', 'type', 'desc', 'source', 'version']) _Filter = collections.namedtuple('Filter', ['id', 'desc']) _Alt = collections.namedtuple('Alt', ['id', 'desc']) _Format = collections.namedtuple('Format', ['id', 'num', 'type', 'desc']) @@ -80,36 +82,39 @@ class _vcf_metadata_parser(object): def __init__(self): super(_vcf_metadata_parser, self).__init__() self.info_pattern = re.compile(r'''\#\#INFO=< - ID=(?P<id>[^,]+), - Number=(?P<number>-?\d+|\.|[AG]), - Type=(?P<type>Integer|Float|Flag|Character|String), + ID=(?P<id>[^,]+),\s* + Number=(?P<number>-?\d+|\.|[AGR]),\s* + Type=(?P<type>Integer|Float|Flag|Character|String),\s* Description="(?P<desc>[^"]*)" + (?:,\s*Source="(?P<source>[^"]*)")? + (?:,\s*Version="?(?P<version>[^"]*)"?)? >''', re.VERBOSE) self.filter_pattern = re.compile(r'''\#\#FILTER=< - ID=(?P<id>[^,]+), + ID=(?P<id>[^,]+),\s* Description="(?P<desc>[^"]*)" >''', re.VERBOSE) self.alt_pattern = re.compile(r'''\#\#ALT=< - ID=(?P<id>[^,]+), + ID=(?P<id>[^,]+),\s* Description="(?P<desc>[^"]*)" >''', re.VERBOSE) self.format_pattern = re.compile(r'''\#\#FORMAT=< - ID=(?P<id>.+), - Number=(?P<number>-?\d+|\.|[AG]), - Type=(?P<type>.+), + ID=(?P<id>.+),\s* + Number=(?P<number>-?\d+|\.|[AGR]),\s* + Type=(?P<type>.+),\s* Description="(?P<desc>.*)" >''', re.VERBOSE) self.contig_pattern = re.compile(r'''\#\#contig=< - ID=(?P<id>[^,]+), - .* - length=(?P<length>-?\d+) + ID=(?P<id>[^>,]+) + (,.*length=(?P<length>-?\d+))? .* >''', re.VERBOSE) self.meta_pattern = re.compile(r'''##(?P<key>.+?)=(?P<val>.+)''') def vcf_field_count(self, num_str): """Cast vcf header numbers to integer or None""" - if num_str not in field_counts: + if num_str is None: + return None + elif num_str not in field_counts: # Fixed, specified number return int(num_str) else: @@ -125,7 +130,8 @@ class _vcf_metadata_parser(object): num = self.vcf_field_count(match.group('number')) info = _Info(match.group('id'), num, - match.group('type'), match.group('desc')) + match.group('type'), match.group('desc'), + match.group('source'), match.group('version')) return (match.group('id'), info) @@ -164,31 +170,28 @@ class _vcf_metadata_parser(object): match.group('type'), match.group('desc')) return (match.group('id'), form) - + def read_contig(self, contig_string): '''Read a meta-contigrmation INFO line.''' match = self.contig_pattern.match(contig_string) if not match: raise SyntaxError( "One of the contig lines is malformed: %s" % contig_string) - length = self.vcf_field_count(match.group('length')) - contig = _Contig(match.group('id'), length) - return (match.group('id'), contig) - def read_meta_hash(self, meta_string): - items = re.split("[<>]", meta_string) - # Removing initial hash marks and final equal sign - key = items[0][2:-1] + # assert re.match("##.+=<", meta_string) + items = meta_string.split('=', 1) + # Removing initial hash marks + key = items[0].lstrip('#') # N.B., items can have quoted values, so cannot just split on comma val = OrderedDict() state = 0 k = '' v = '' - for c in items[1]: + for c in items[1].strip('[<>]'): if state == 0: # reading item key if c == '=': @@ -219,16 +222,20 @@ class _vcf_metadata_parser(object): def read_meta(self, meta_string): if re.match("##.+=<", meta_string): return self.read_meta_hash(meta_string) - else: - match = self.meta_pattern.match(meta_string) - return match.group('key'), match.group('val') + match = self.meta_pattern.match(meta_string) + if not match: + # Spec only allows key=value, but we try to be liberal and + # interpret anything else as key=none (and all values are parsed + # as strings). + return meta_string.lstrip('#'), 'none' + return match.group('key'), match.group('val') class Reader(object): """ Reader for a VCF v 4.0 file, an iterator returning ``_Record objects`` """ - def __init__(self, fsock=None, filename=None, compressed=False, prepend_chr=False, - strict_whitespace=False): + def __init__(self, fsock=None, filename=None, compressed=None, prepend_chr=False, + strict_whitespace=False, encoding='ascii'): """ Create a new Reader for a VCF file. You must specify either fsock (stream) or filename. Gzipped streams @@ -250,21 +257,26 @@ class Reader(object): self._reader = fsock if filename is None and hasattr(fsock, 'name'): filename = fsock.name - compressed = compressed or filename.endswith('.gz') + if compressed is None: + compressed = filename.endswith('.gz') elif filename: - compressed = compressed or filename.endswith('.gz') + if compressed is None: + compressed = filename.endswith('.gz') self._reader = open(filename, 'rb' if compressed else 'rt') self.filename = filename if compressed: self._reader = gzip.GzipFile(fileobj=self._reader) if sys.version > '3': - self._reader = codecs.getreader('ascii')(self._reader) + self._reader = codecs.getreader(encoding)(self._reader) if strict_whitespace: self._separator = '\t' else: self._separator = '\t| +' + self._row_pattern = re.compile(self._separator) + self._alt_pattern = re.compile('[\[\]]') + self.reader = (line.strip() for line in self._reader if line.strip()) #: metadata fields from header (string or hash, depending) @@ -287,6 +299,7 @@ class Reader(object): self._prepend_chr = prepend_chr self._parse_metainfo() self._format_cache = {} + self.encoding = encoding def __iter__(self): return self @@ -301,7 +314,7 @@ class Reader(object): parser = _vcf_metadata_parser() - line = self.reader.next() + line = next(self.reader) while line.startswith('##'): self._header_lines.append(line) @@ -320,7 +333,7 @@ class Reader(object): elif line.startswith('##FORMAT'): key, val = parser.read_format(line) self.formats[key] = val - + elif line.startswith('##contig'): key, val = parser.read_contig(line) self.contigs[key] = val @@ -334,9 +347,9 @@ class Reader(object): self.metadata[key] = [] self.metadata[key].append(val) - line = self.reader.next() + line = next(self.reader) - fields = re.split(self._separator, line[1:]) + fields = self._row_pattern.split(line[1:]) self._column_headers = fields[:9] self.samples = fields[9:] self._sample_indexes = dict([(x,i) for (i,x) in enumerate(self.samples)]) @@ -358,7 +371,7 @@ class Reader(object): retdict = {} for entry in entries: - entry = entry.split('=') + entry = entry.split('=', 1) ID = entry[0] try: entry_type = self.infos[ID].type @@ -389,6 +402,7 @@ class Reader(object): vals = entry[1].split(',') # commas are reserved characters indicating multiple values val = self._map(str, vals) except IndexError: + entry_type = 'Flag' val = True try: @@ -430,7 +444,6 @@ class Reader(object): # check whether we already know how to parse this format if samp_fmt not in self._format_cache: self._format_cache[samp_fmt] = self._parse_sample_format(samp_fmt) - samp_fmt = self._format_cache[samp_fmt] if cparse: @@ -450,7 +463,10 @@ class Reader(object): for i, vals in enumerate(sample.split(':')): # short circuit the most common - if vals == '.' or vals == './.': + if samp_fmt._fields[i] == 'GT': + sampdat[i] = vals + continue + elif not vals or vals == ".": sampdat[i] = None continue @@ -494,9 +510,9 @@ class Reader(object): return samp_data def _parse_alt(self, str): - if re.search('[\[\]]', str) is not None: + if self._alt_pattern.search(str) is not None: # Paired breakend - items = re.split('[\[\]]', str) + items = self._alt_pattern.split(str) remoteCoords = items[1].split(':') chr = remoteCoords[0] if chr[0] == '<': @@ -523,8 +539,8 @@ class Reader(object): def next(self): '''Return the next record in the file.''' - line = self.reader.next() - row = re.split(self._separator, line.rstrip()) + line = next(self.reader) + row = self._row_pattern.split(line.rstrip()) chrom = row[0] if self._prepend_chr: chrom = 'chr' + chrom @@ -559,6 +575,9 @@ class Reader(object): fmt = row[8] except IndexError: fmt = None + else: + if fmt == '.': + fmt = None record = _Record(chrom, pos, ID, ref, alt, qual, filt, info, fmt, self._sample_indexes) @@ -569,45 +588,60 @@ class Reader(object): return record - def fetch(self, chrom, start, end=None): - """ fetch records from a Tabix indexed VCF, requires pysam - if start and end are specified, return iterator over positions - if end not specified, return individual ``_Call`` at start or None + def fetch(self, chrom, start=None, end=None): + """ Fetches records from a tabix-indexed VCF file and returns an + iterable of ``_Record`` instances + + chrom must be specified. + + The start and end coordinates are in the zero-based, + half-open coordinate system, similar to ``_Record.start`` and + ``_Record.end``. The very first base of a chromosome is + index 0, and the the region includes bases up to, but not + including the base at the end coordinate. For example + ``fetch('4', 10, 20)`` would include all variants + overlapping a 10 base pair region from the 11th base of + through the 20th base (which is at index 19) of chromosome + 4. It would not include the 21st base (at index 20). See + http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms + for more information on the zero-based, half-open coordinate + system. + + If end is omitted, all variants from start until the end of + the chromosome chrom will be included. + + If start and end are omitted, all variants on chrom will be + returned. + + requires pysam + """ if not pysam: raise Exception('pysam not available, try "pip install pysam"?') - if not self.filename: raise Exception('Please provide a filename (or a "normal" fsock)') if not self._tabix: - self._tabix = pysam.Tabixfile(self.filename) + self._tabix = pysam.Tabixfile(self.filename, + encoding=self.encoding) if self._prepend_chr and chrom[:3] == 'chr': chrom = chrom[3:] - # not sure why tabix needs position -1 - start = start - 1 - - if end is None: - self.reader = self._tabix.fetch(chrom, start, start + 1) - try: - return self.next() - except StopIteration: - return None - self.reader = self._tabix.fetch(chrom, start, end) return self class Writer(object): - """ VCF Writer """ + """VCF Writer. On Windows Python 2, open stream with 'wb'.""" # Reverse keys and values in header field count dictionary counts = dict((v,k) for k,v in field_counts.iteritems()) def __init__(self, stream, template, lineterminator="\n"): - self.writer = csv.writer(stream, delimiter="\t", lineterminator=lineterminator) + self.writer = csv.writer(stream, delimiter="\t", + lineterminator=lineterminator, + quotechar='', quoting=csv.QUOTE_NONE) self.template = template self.stream = stream @@ -639,7 +673,10 @@ class Writer(object): for line in template.alts.itervalues(): stream.write(two.format(key="ALT", *line)) for line in template.contigs.itervalues(): - stream.write('##contig=<ID={0},length={1}>\n'.format(*line)) + if line.length: + stream.write('##contig=<ID={0},length={1}>\n'.format(*line)) + else: + stream.write('##contig=<ID={0}>\n'.format(*line)) self._write_header() @@ -699,28 +736,14 @@ class Writer(object): sorted(info, key=order_key)) def _format_sample(self, fmt, sample): - try: - # Try to get the GT value first. - gt = getattr(sample.data, 'GT') - # PyVCF stores './.' GT values as None, so we need to revert it back - # to './.' when writing. - if gt is None: - gt = './.' - except AttributeError: - # Failing that, try to check whether 'GT' is specified in the FORMAT - # field. If yes, use the recommended empty value ('./.') - if 'GT' in fmt: - gt = './.' - # Otherwise use an empty string as the value - else: - gt = '' - # If gt is an empty string (i.e. not stored), write all other data + if hasattr(sample.data, 'GT'): + gt = sample.data.GT + else: + gt = './.' if 'GT' in fmt else '' + if not gt: return ':'.join([self._stringify(x) for x in sample.data]) - # Otherwise use the GT values from above and combine it with the rest of - # the data. - # Note that this follows the VCF spec, where GT is always the first - # item whenever it is present. + # Following the VCF spec, GT is always the first item whenever it is present. else: return ':'.join([gt] + [self._stringify(x) for x in sample.data[1:]]) diff --git a/vcf/sample_filter.py b/vcf/sample_filter.py new file mode 100644 index 0000000..b156b45 --- /dev/null +++ b/vcf/sample_filter.py @@ -0,0 +1,115 @@ +# Author: Lenna X. Peterson +# github.com/lennax +# arklenna at gmail dot com + +import logging +import sys +import warnings + + +from parser import Reader, Writer + + +class SampleFilter(object): + """ + Modifies the vcf Reader to filter each row by sample as it is parsed. + + """ + + def __init__(self, infile, outfile=None, filters=None, invert=False): + # Methods to add to Reader + def get_filter(self): + return self._samp_filter + + def set_filter(self, filt): + self._samp_filter = filt + if filt: + self.samples = [val for idx, val in enumerate(self.samples) + if idx not in set(filt)] + + def filter_samples(fn): + """Decorator function to filter sample parameter""" + def filt(self, samples, *args): + samples = [val for idx, val in enumerate(samples) + if idx not in set(self.sample_filter)] + return fn(self, samples, *args) + return filt + + # Add property to Reader for filter list + Reader.sample_filter = property(get_filter, set_filter) + Reader._samp_filter = [] + # Modify Reader._parse_samples to filter samples + self._orig_parse_samples = Reader._parse_samples + Reader._parse_samples = filter_samples(Reader._parse_samples) + self.parser = Reader(filename=infile) + # Store initial samples and indices + self.samples = self.parser.samples + self.smp_idx = dict([(v, k) for k, v in enumerate(self.samples)]) + # Properties for filter/writer + self.outfile = outfile + self.invert = invert + self.filters = filters + if filters is not None: + self.set_filters() + self.write() + + def __del__(self): + try: + self._undo_monkey_patch() + except AttributeError: + pass + + def set_filters(self, filters=None, invert=False): + """Convert filters from string to list of indices, set on Reader""" + if filters is not None: + self.filters = filters + if invert: + self.invert = invert + filt_l = self.filters.split(",") + filt_s = set(filt_l) + if len(filt_s) < len(filt_l): + warnings.warn("Non-unique filters, ignoring", RuntimeWarning) + + def filt2idx(item): + """Convert filter to valid sample index""" + try: + item = int(item) + except ValueError: + # not an idx, check if it's a value + return self.smp_idx.get(item) + else: + # is int, check if it's an idx + if item < len(self.samples): + return item + filters = set(filter(lambda x: x is not None, map(filt2idx, filt_s))) + if len(filters) < len(filt_s): + # TODO print the filters that were ignored + warnings.warn("Invalid filters, ignoring", RuntimeWarning) + + if self.invert: + filters = set(xrange(len(self.samples))).difference(filters) + + # `sample_filter` setter updates `samples` + self.parser.sample_filter = filters + if len(self.parser.samples) == 0: + warnings.warn("Number of samples to keep is zero", RuntimeWarning) + logging.info("Keeping these samples: {0}\n".format(self.parser.samples)) + return self.parser.samples + + def write(self, outfile=None): + if outfile is not None: + self.outfile = outfile + if self.outfile is None: + _out = sys.stdout + elif hasattr(self.outfile, 'write'): + _out = self.outfile + else: + _out = open(self.outfile, "wb") + logging.info("Writing to '{0}'\n".format(self.outfile)) + writer = Writer(_out, self.parser) + for row in self.parser: + writer.write_record(row) + + def _undo_monkey_patch(self): + Reader._parse_samples = self._orig_parse_samples + delattr(Reader, 'sample_filter') diff --git a/vcf/test/FT.vcf b/vcf/test/FT.vcf new file mode 100644 index 0000000..e42436a --- /dev/null +++ b/vcf/test/FT.vcf @@ -0,0 +1,50 @@ +##fileformat=VCFv4.2 +##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location"> +##FILTER=<ID=DP125,Description="DP<125"> +##FILTER=<ID=DP130,Description="DP<130"> +##FILTER=<ID=LowQual,Description="Low quality"> +##FILTER=<ID=Q4800,Description="QUAL<4800.0"> +##FILTER=<ID=Q5000,Description="QUAL<5000.0"> +##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed"> +##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> +##FORMAT=<ID=FT,Number=.,Type=String,Description="Genotype-level filter"> +##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> +##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> +##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> +##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"> +##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)"> +##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias."> +##GATKCommandLine.VariantFiltration=<ID=VariantFiltration,Version=3.6-0-g89b7209,Date="Wed Jul 27 09:50:44 CEST 2016",Epoch=1469605844963,CommandLineOptions="analysis_type=VariantFiltration input_file=[] showFullBamList=false read_buffer_size=null read_filter=[] disable_read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=../ref.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 max [...] +##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed"> +##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed"> +##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes"> +##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities"> +##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases"> +##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered"> +##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?"> +##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval"> +##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity"> +##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias"> +##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes"> +##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation"> +##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed"> +##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed"> +##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality"> +##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities"> +##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth"> +##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality"> +##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias"> +##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias"> +##contig=<ID=ref,length=4888768> +##reference=file://../ref.fasta +#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1 2 3 4 5 +ref 63393 . C A 29454.60 . AC=5;AF=1.00;AN=5;DP=719;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;QD=29.67;SOR=0.965 GT:AD:DP:GQ:PL 1:0,166:166:99:6740,0 1:0,142:142:99:5824,0 1:0,134:134:99:5616,0 1:0,122:122:99:4930,0 1:0,155:155:99:6371,0 +ref 65903 . AATTGCGCTG A 7340.57 PASS AC=1;AF=0.200;AN=5;DP=524;FS=0.000;MLEAC=1;MLEAF=0.200;MQ=60.00;QD=34.04;SOR=1.091 GT:AD:DP:FT:GQ:PL 1:0,164:164:PASS:99:7383,0 0:95,0:95:DP125;DP130:99:0,1800 0:88,0:88:DP125;DP130:99:0,1800 0:87,0:87:DP125;DP130:99:0,1800 0:89,0:89:DP125;DP130:99:0,1800 +ref 70837 . C A 4711.61 Q4800;Q5000 AC=1;AF=0.200;AN=5;DP=512;FS=0.000;MLEAC=1;MLEAF=0.200;MQ=60.00;QD=27.64;SOR=0.726 GT:AD:DP:FT:GQ:PL 0:121,0:121:DP125;DP130:99:0,1800 0:95,0:95:DP125;DP130:99:0,1800 1:0,120:120:DP125;DP130:99:4745,0 0:87,0:87:DP125;DP130:99:0,1800 0:89,0:89:DP125;DP130:99:0,1800 +ref 71448 . C T 31134.60 PASS AC=5;AF=1.00;AN=5;BaseQRankSum=2.22;ClippingRankSum=0.00;DP=768;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;MQRankSum=0.00;QD=29.43;ReadPosRankSum=2.03;SOR=0.295 GT:AD:DP:FT:GQ:PL 1:0,147:147:PASS:99:5996,0 1:1,183:184:PASS:99:7501,0 1:0,113:113:DP125;DP130:99:4559,0 1:0,161:161:PASS:99:6436,0 1:0,163:163:PASS:99:6669,0 +ref 104257 . C T 5521.61 PASS AC=1;AF=0.200;AN=5;DP=506;FS=0.000;MLEAC=1;MLEAF=0.200;MQ=60.00;QD=29.45;SOR=0.854 GT:AD:DP:FT:GQ:PL 0:101,0:101:DP125;DP130:99:0,1800 0:109,0:109:DP125;DP130:99:0,1800 1:0,132:132:PASS:99:5555,0 0:67,0:67:DP125;DP130:99:0,1800 0:97,0:97:DP125;DP130:99:0,1800 +ref 140658 . C A 32467.60 PASS AC=5;AF=1.00;AN=5;BaseQRankSum=2.24;ClippingRankSum=0.00;DP=801;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;MQRankSum=0.00;QD=29.65;ReadPosRankSum=1.27;SOR=0.346 GT:AD:DP:GQ:PL 1:0,170:170:99:6854,0 1:0,198:198:99:8098,0 1:0,136:136:99:5554,0 1:0,141:141:99:5661,0 1:1,155:156:99:6327,0 +ref 147463 . C A 4885.61 Q5000 AC=1;AF=0.200;AN=5;BaseQRankSum=-7.720e-01;ClippingRankSum=0.00;DP=503;FS=0.000;MLEAC=1;MLEAF=0.200;MQ=60.00;MQRankSum=0.00;QD=35.03;ReadPosRankSum=-6.950e-01;SOR=0.278 GT:AD:DP:FT:GQ:PL 0:97,0:97:DP125;DP130:99:0,1800 0:104,0:104:DP125;DP130:99:0,1800 0:84,0:84:DP125;DP130:99:0,1800 1:1,128:129:DP130:99:4919,0 0:89,0:89:DP125;DP130:99:0,1800 +ref 154578 . A G 32015.60 PASS AC=5;AF=1.00;AN=5;DP=776;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;QD=25.82;SOR=0.902 GT:AD:DP:GQ:PL 1:0,152:152:99:6300,0 1:0,183:183:99:7608,0 1:0,137:137:99:5713,0 1:0,148:148:99:6040,0 1:0,156:156:99:6381,0 +ref 203200 . C T 30880.60 PASS AC=5;AF=1.00;AN=5;DP=752;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;QD=29.65;SOR=0.878 GT:AD:DP:FT:GQ:PL 1:0,161:161:PASS:99:6708,0 1:0,185:185:PASS:99:7602,0 1:0,136:136:PASS:99:5602,0 1:0,126:126:DP130:99:5080,0 1:0,144:144:PASS:99:5915,0 +ref 231665 . C T 30074.60 PASS AC=5;AF=1.00;AN=5;DP=735;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;QD=33.23;SOR=0.938 GT:AD:DP:FT:GQ:PL 1:0,191:191:PASS:99:7867,0 1:0,159:159:PASS:99:6431,0 1:0,130:130:PASS:99:5299,0 1:0,129:129:DP130:99:5290,0 1:0,126:126:DP130:99:5214,0 diff --git a/vcf/test/issue-214.vcf b/vcf/test/issue-214.vcf new file mode 100644 index 0000000..dbc5fac --- /dev/null +++ b/vcf/test/issue-214.vcf @@ -0,0 +1,32 @@ +##fileformat=VCFv4.1 +##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location"> +##FILTER=<ID=LowQual,Description="Low quality"> +##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed"> +##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> +##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> +##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> +##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> +##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"> +##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)"> +##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias."> +##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed"> +##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed"> +##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes"> +##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities"> +##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases"> +##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered"> +##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?"> +##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval"> +##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias"> +##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes"> +##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation"> +##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed"> +##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed"> +##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality"> +##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities"> +##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth"> +##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias"> +##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias"> +#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 +1 456904 . T C,* 6162.77 . AC=1,1;AF=8.333e-03,8.333e-03;AN=120;DP=7693;FS=0.000;MLEAC=1,1;MLEAF=8.333e-03,8.333e-03;MQ=60.00;QD=31.36;SOR=0.976 GT:AD:DP:GQ:PL 0:106,0,0:106:99:0,1800,1800 0:110,0,0:110:99:0,1800,1800 +1 456940 . * C,T 6162.77 . AC=1,1;AF=8.333e-03,8.333e-03;AN=120;DP=7693;FS=0.000;MLEAC=1,1;MLEAF=8.333e-03,8.333e-03;MQ=60.00;QD=31.36;SOR=0.976 GT:AD:DP:GQ:PL 0:106,0,0:106:99:0,1800,1800 0:110,0,0:110:99:0,1800,1800 diff --git a/vcf/test/strelka.vcf b/vcf/test/strelka.vcf new file mode 100644 index 0000000..b5aea76 --- /dev/null +++ b/vcf/test/strelka.vcf @@ -0,0 +1,57 @@ +##fileformat=VCFv4.1 +##FILTER=<ID=BCNoise,Description="Average fraction of filtered basecalls within 50 bases of the indel exceeds 0.3"> +##FILTER=<ID=QSI_ref,Description="Normal sample is not homozygous ref or sindel Q-score < 30, ie calls with NT!=ref or QSI_NT < 30"> +##FILTER=<ID=QSS_ref,Description="Normal sample is not homozygous ref or ssnv Q-score < 15, ie calls with NT!=ref or QSS_NT < 15"> +##FILTER=<ID=Repeat,Description="Sequence repeat of more than 8x in the reference sequence"> +##FILTER=<ID=SpanDel,Description="Fraction of reads crossing site with spanning deletions in either sample exceeeds 0.75"> +##FILTER=<ID=iHpol,Description="Indel overlaps an interrupted homopolymer longer than 14x in the reference sequence"> +##FORMAT=<ID=AU,Number=2,Type=Integer,Description="Number of 'A' alleles used in tiers 1,2"> +##FORMAT=<ID=CU,Number=2,Type=Integer,Description="Number of 'C' alleles used in tiers 1,2"> +##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth for tier1"> +##FORMAT=<ID=DP2,Number=1,Type=Integer,Description="Read depth for tier2"> +##FORMAT=<ID=DP50,Number=1,Type=Float,Description="Average tier1 read depth within 50 bases"> +##FORMAT=<ID=FDP,Number=1,Type=Integer,Description="Number of basecalls filtered from original read depth for tier1"> +##FORMAT=<ID=FDP50,Number=1,Type=Float,Description="Average tier1 number of basecalls filtered from original read depth within 50 bases"> +##FORMAT=<ID=GU,Number=2,Type=Integer,Description="Number of 'G' alleles used in tiers 1,2"> +##FORMAT=<ID=SDP,Number=1,Type=Integer,Description="Number of reads with deletions spanning this site at tier1"> +##FORMAT=<ID=SUBDP,Number=1,Type=Integer,Description="Number of reads below tier1 mapping quality threshold aligned across this site"> +##FORMAT=<ID=SUBDP50,Number=1,Type=Float,Description="Average number of reads below tier1 mapping quality threshold aligned across sites within 50 bases"> +##FORMAT=<ID=TAR,Number=2,Type=Integer,Description="Reads strongly supporting alternate allele for tiers 1,2"> +##FORMAT=<ID=TIR,Number=2,Type=Integer,Description="Reads strongly supporting indel allele for tiers 1,2"> +##FORMAT=<ID=TOR,Number=2,Type=Integer,Description="Other reads (weak support or insufficient indel breakpoint overlap) for tiers 1,2"> +##FORMAT=<ID=TU,Number=2,Type=Integer,Description="Number of 'T' alleles used in tiers 1,2"> +##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed"> +##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed"> +##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes"> +##INFO=<ID=IC,Number=1,Type=Integer,Description="Number of times RU repeats in the indel allele"> +##INFO=<ID=IHP,Number=1,Type=Integer,Description="Largest reference interrupted homopolymer length intersecting with the indel"> +##INFO=<ID=NT,Number=1,Type=String,Description="Genotype of the normal in all data tiers, as used to classify somatic variants. One of {ref,het,hom,conflict}."> +##INFO=<ID=OVERLAP,Number=0,Type=Flag,Description="Somatic indel possibly overlaps a second indel."> +##INFO=<ID=QSI,Number=1,Type=Integer,Description="Quality score for any somatic variant, ie. for the ALT haplotype to be present at a significantly different frequency in the tumor and normal"> +##INFO=<ID=QSI_NT,Number=1,Type=Integer,Description="Quality score reflecting the joint probability of a somatic variant and NT"> +##INFO=<ID=QSS,Number=1,Type=Integer,Description="Quality score for any somatic snv, ie. for the ALT allele to be present at a significantly different frequency in the tumor and normal"> +##INFO=<ID=QSS_NT,Number=1,Type=Integer,Description="Quality score reflecting the joint probability of a somatic variant and NT"> +##INFO=<ID=RC,Number=1,Type=Integer,Description="Number of times RU repeats in the reference allele"> +##INFO=<ID=RU,Number=1,Type=String,Description="Smallest repeating sequence unit in inserted or deleted sequence"> +##INFO=<ID=SGT,Number=1,Type=String,Description="Most likely somatic genotype excluding normal noise states"> +##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Somatic mutation"> +##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> +##INFO=<ID=TQSI,Number=1,Type=Integer,Description="Data tier used to compute QSI"> +##INFO=<ID=TQSI_NT,Number=1,Type=Integer,Description="Data tier used to compute QSI_NT"> +##INFO=<ID=TQSS,Number=1,Type=Integer,Description="Data tier used to compute QSS"> +##INFO=<ID=TQSS_NT,Number=1,Type=Integer,Description="Data tier used to compute QSS_NT"> +##INFO=<ID=set,Number=1,Type=String,Description="Source VCF for the merged record in CombineVariants"> +##content=strelka somatic indel calls +##fileDate=20151113 +##germlineIndelTheta=0.0001 +##germlineSnvTheta=0.001 +##priorSomaticIndelRate=1e-06 +##priorSomaticSnvRate=1e-06 +##reference=file:///b37.fasta +##source=strelka +##source_version=2.0.17.strelka1 +##startTime=Fri Nov 13 19:38:43 2015 +#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL.variant NORMAL.variant2 TUMOR.variant TUMOR.variant2 +1 1666175 . C T . PASS AC=0;AF=0.00;AN=0;NT=ref;QSS=28;QSS_NT=28;SGT=CC->CT;SOMATIC;TQSS=1;TQSS_NT=1;set=variant AU:CU:DP:FDP:GU:SDP:SUBDP:TU 0,0:42,42:43:0:0,0:0:0:1,1 0,0:45,45:59:0:0,0:0:0:14,14 +1 3750492 . G A . PASS AC=0;AF=0.00;AN=0;NT=ref;QSS=38;QSS_NT=38;SGT=GG->AG;SOMATIC;TQSS=2;TQSS_NT=2;set=variant AU:CU:DP:FDP:GU:SDP:SUBDP:TU 0,0:0,0:116:0:116,116:0:0:0,0 6,6:0,0:96:0:90,91:0:0:0,0 +1 9117626 . G A . PASS AC=0;AF=0.00;AN=0;NT=ref;QSS=32;QSS_NT=32;SGT=GG->AG;SOMATIC;TQSS=1;TQSS_NT=1;set=variant AU:CU:DP:FDP:GU:SDP:SUBDP:TU 0,0:0,0:165:0:165,166:0:0:0,0 6,6:0,0:132:0:126,127:0:0:0,0 \ No newline at end of file diff --git a/vcf/test/test_vcf.py b/vcf/test/test_vcf.py index efa633e..20b71ad 100644 --- a/vcf/test/test_vcf.py +++ b/vcf/test/test_vcf.py @@ -1,13 +1,27 @@ from __future__ import print_function import unittest +try: + unittest.skip +except AttributeError: + import unittest2 as unittest import doctest import os import commands import cPickle from StringIO import StringIO +import subprocess +import sys + +try: + import pysam +except ImportError: + pysam = None import vcf -from vcf import utils +from vcf import model, utils + +IS_PYTHON2 = sys.version_info[0] == 2 +IS_NOT_PYPY = 'PyPy' not in sys.version suite = doctest.DocTestSuite(vcf) @@ -100,6 +114,37 @@ class TestVcfSpecs(unittest.TestCase): print(c) assert c + def test_vcf_4_2(self): + reader = vcf.Reader(fh('example-4.2.vcf')) + self.assertEqual(reader.metadata['fileformat'], 'VCFv4.2') + + # If INFO contains no Source and Version keys, they should be None. + self.assertEqual(reader.infos['DP'].source, None) + self.assertEqual(reader.infos['DP'].version, None) + + # According to spec, INFO Version key is required to be double quoted, + # but at least SAMtools 1.0 does not quote it. So we want to be + # forgiving here. + self.assertEqual(reader.infos['VDB'].source, None) + self.assertEqual(reader.infos['VDB'].version, '3') + + # test we can walk the file at least + for r in reader: + for c in r: + assert c + + def test_contig_idonly(self): + """Test VCF inputs with ##contig inputs containing only IDs. produced by bcftools 1.2+ + """ + reader = vcf.Reader(fh("contig_idonly.vcf")) + for cid, contig in reader.contigs.items(): + if cid == "1": + assert contig.length is None + elif cid == "2": + assert contig.length == 2000 + elif cid == "3": + assert contig.length == 3000 + class TestGatkOutput(unittest.TestCase): filename = 'gatk.vcf' @@ -184,6 +229,34 @@ class TestBcfToolsOutput(unittest.TestCase): for s in r.samples: s.phased +class TestIssue214(unittest.TestCase): + """ See https://github.com/jamescasbon/PyVCF/issues/214 """ + + def test_issue_214_is_snp(self): + reader=vcf.Reader(fh('issue-214.vcf')) + r=next(reader) + self.assertTrue(r.is_snp) + + def test_issue_214_var_type(self): + reader=vcf.Reader(fh('issue-214.vcf')) + r=next(reader) + self.assertEqual(r.var_type,'snp') + + # Can the ref even be a spanning deletion? + # Note, this does not trigger issue 214, but I've added it here for completeness + def test_issue_214_ref_is_del_is_snp(self): + reader=vcf.Reader(fh('issue-214.vcf')) + next(reader) + r=next(reader) + self.assertTrue(r.is_snp) + + # Can the ref even be a spanning deletion? + # Note, this does not trigger issue 214, but I've added it here for completeness + def test_issue_214_ref_is_del_var_type(self): + reader=vcf.Reader(fh('issue-214.vcf')) + next(reader) + r=next(reader) + self.assertEqual(r.var_type,'snp') class Test1kg(unittest.TestCase): @@ -248,6 +321,16 @@ class TestGoNL(unittest.TestCase): self.assertEqual(reader.contigs['1'].length, 249250621) +class TestStringAsFlag(unittest.TestCase): + + def test_string_as_flag(self): + """A flag INFO field is declared as string (not allowed by the spec, + but seen in practice).""" + reader = vcf.Reader(fh('string_as_flag.vcf', 'r')) + for _ in reader: + pass + + class TestInfoOrder(unittest.TestCase): def _assert_order(self, definitions, fields): @@ -310,6 +393,38 @@ class TestInfoTypeCharacter(unittest.TestCase): self.assertEquals(l.INFO, r.INFO) +class TestParseMetaLine(unittest.TestCase): + def test_parse(self): + reader = vcf.Reader(fh('parse-meta-line.vcf')) + f = reader.metadata['MYFIELD'][0] + self.assertEqual(f['ID'], 'SomeField') + self.assertEqual(f['Version'], '3.4-0-g7e26428') + self.assertEqual(f['Date'], '"Wed Oct 07 09:11:47 CEST 2015"') + self.assertEqual(f['Options'], '"< 4 and > 3"') + next(reader) + + def test_write(self): + reader = vcf.Reader(fh('parse-meta-line.vcf')) + out = StringIO() + writer = vcf.Writer(out, reader) + + records = list(reader) + + for record in records: + writer.write_record(record) + out.seek(0) + reader2 = vcf.Reader(out) + + f = reader2.metadata['MYFIELD'][0] + self.assertEqual(f['ID'], 'SomeField') + self.assertEqual(f['Version'], '3.4-0-g7e26428') + self.assertEqual(f['Date'], '"Wed Oct 07 09:11:47 CEST 2015"') + self.assertEqual(f['Options'], '"< 4 and > 3"') + + for l, r in zip(records, reader2): + self.assertEquals(l.INFO, r.INFO) + + class TestGatkOutputWriter(unittest.TestCase): def testWrite(self): @@ -402,6 +517,27 @@ class TestSamplesSpace(unittest.TestCase): self.assertEqual(self.reader.samples, self.samples) +class TestMetadataWhitespace(unittest.TestCase): + filename = 'metadata-whitespace.vcf' + def test_metadata_whitespace(self): + """ + Test parsing metadata header lines with whitespace. + """ + self.reader = vcf.Reader(fh(self.filename)) + + # Pick one INFO line and assert that we parsed it correctly. + info_indel = self.reader.infos['INDEL'] + assert info_indel.id == 'INDEL' + assert info_indel.num == 0 + assert info_indel.type == 'Flag' + assert info_indel.desc == 'Indicates that the variant is an INDEL.' + + # Test we can walk the file at least. + for r in self.reader: + for c in r: + pass + + class TestMixedFiltering(unittest.TestCase): filename = 'mixed-filtering.vcf' def test_mixed_filtering(self): @@ -426,7 +562,7 @@ class TestRecord(unittest.TestCase): self.assertEqual(len(var.samples), num_calls) def test_dunder_eq(self): - rec = vcf.Reader(fh('example-4.0.vcf')).next() + rec = next(vcf.Reader(fh('example-4.0.vcf'))) self.assertFalse(rec == None) self.assertFalse(None == rec) @@ -459,6 +595,13 @@ class TestRecord(unittest.TestCase): self.assertEqual([0.0/6.0], aaf) elif var.POS == 1234567: self.assertEqual([2.0/4.0, 1.0/4.0], aaf) + reader = vcf.Reader(fh('example-4.1-ploidy.vcf')) + for var in reader: + aaf = var.aaf + if var.POS == 60034: + self.assertEqual([4.0/6.0], aaf) + elif var.POS == 60387: + self.assertEqual([1.0/3.0], aaf) def test_pi(self): reader = vcf.Reader(fh('example-4.0.vcf')) @@ -510,6 +653,24 @@ class TestRecord(unittest.TestCase): elif var.POS == 1234567: self.assertEqual(False, is_snp) + + def test_is_snp_for_n_alt(self): + record = model._Record( + '1', + 10, + 'id1', + 'C', + [model._Substitution('N')], + None, + None, + {}, + None, + {}, + None + ) + self.assertTrue(record.is_snp) + + def test_is_indel(self): reader = vcf.Reader(fh('example-4.0.vcf')) for var in reader: @@ -731,7 +892,7 @@ class TestRecord(unittest.TestCase): def test_info_multiple_values(self): reader = vcf.Reader(fh('example-4.1-info-multiple-values.vcf')) - var = reader.next() + var = next(reader) # check Float type INFO field with multiple values expected = [19.3, 47.4, 14.0] actual = var.INFO['RepeatCopies'] @@ -751,11 +912,244 @@ class TestRecord(unittest.TestCase): self.assertEqual(cPickle.loads(cPickle.dumps(var)), var) + def assert_has_expected_coordinates( + self, + record, + expected_coordinates, + expected_affected_coordinates + ): + self.assertEqual( + (record.start, record.end), + expected_coordinates + ) + self.assertEqual( + (record.affected_start, record.affected_end), + expected_affected_coordinates + ) + + + def test_coordinates_for_snp(self): + record = model._Record( + '1', + 10, + 'id1', + 'C', + [model._Substitution('A')], + None, + None, + {}, + None, + {}, + None + ) + self.assert_has_expected_coordinates(record, (9, 10), (9, 10)) + + + def test_coordinates_for_insertion(self): + record = model._Record( + '1', + 10, + 'id2', + 'C', + [model._Substitution('CTA')], + None, + None, + {}, + None, + {}, + None + ) + self.assert_has_expected_coordinates(record, (9, 10), (10, 10)) + + + def test_coordinates_for_deletion(self): + record = model._Record( + '1', + 10, + 'id3', + 'CTA', + [model._Substitution('C')], + None, + None, + {}, + None, + {}, + None + ) + self.assert_has_expected_coordinates(record, (9, 12), (10, 12)) + + + def test_coordinates_for_None_alt(self): + record = model._Record( + '1', + 10, + 'id4', + 'C', + [None], + None, + None, + {}, + None, + {}, + None + ) + self.assert_has_expected_coordinates(record, (9, 10), (9, 10)) + + + def test_coordinates_for_multiple_snps(self): + record = model._Record( + '1', + 10, + 'id5', + 'C', + [ + model._Substitution('A'), + model._Substitution('G'), + model._Substitution('T') + ], + None, + None, + {}, + None, + {}, + None + ) + self.assert_has_expected_coordinates(record, (9, 10), (9, 10)) + + + def test_coordinates_for_insert_and_snp(self): + record = model._Record( + '1', + 10, + 'id6', + 'C', + [ + model._Substitution('GTA'), + model._Substitution('G'), + ], + None, + None, + {}, + None, + {}, + None + ) + self.assert_has_expected_coordinates(record, (9, 10), (9, 10)) + record = model._Record( + '1', + 10, + 'id7', + 'C', + [ + model._Substitution('G'), + model._Substitution('GTA'), + ], + None, + None, + {}, + None, + {}, + None + ) + self.assert_has_expected_coordinates(record, (9, 10), (9, 10)) + + + def test_coordinates_for_snp_and_deletion(self): + record = model._Record( + '1', + 10, + 'id8', + 'CTA', + [ + model._Substitution('C'), + model._Substitution('CTG'), + ], + None, + None, + {}, + None, + {}, + None + ) + self.assert_has_expected_coordinates(record, (9, 12), (10, 12)) + record = model._Record( + '1', + 10, + 'id9', + 'CTA', + [ + model._Substitution('CTG'), + model._Substitution('C'), + ], + None, + None, + {}, + None, + {}, + None + ) + self.assert_has_expected_coordinates(record, (9, 12), (10, 12)) + + + def test_coordinates_for_insertion_and_deletion(self): + record = model._Record( + '1', + 10, + 'id10', + 'CT', + [ + model._Substitution('CA'), + model._Substitution('CTT'), + ], + None, + None, + {}, + None, + {}, + None + ) + self.assert_has_expected_coordinates(record, (9, 11), (10, 11)) + record = model._Record( + '1', + 10, + 'id11', + 'CT', + [ + model._Substitution('CTT'), + model._Substitution('CA'), + ], + None, + None, + {}, + None, + {}, + None + ) + self.assert_has_expected_coordinates(record, (9, 11), (10, 11)) + + + def test_coordinates_for_breakend(self): + record = model._Record( + '1', + 10, + 'id12', + 'CTA', + [model._Breakend('1', 500, False, True, 'GGTC', True)], + None, + None, + {}, + None, + {}, + None + ) + self.assert_has_expected_coordinates(record, (9, 12), (9, 12)) + + class TestCall(unittest.TestCase): def test_dunder_eq(self): reader = vcf.Reader(fh('example-4.0.vcf')) - var = reader.next() + var = next(reader) example_call = var.samples[0] self.assertFalse(example_call == None) self.assertFalse(None == example_call) @@ -808,40 +1202,73 @@ class TestCall(unittest.TestCase): self.assertEqual([None,1,2], gt_types) -class TestTabix(unittest.TestCase): [email protected](pysam, "test requires installation of PySAM.") +class TestFetch(unittest.TestCase): def setUp(self): self.reader = vcf.Reader(fh('tb.vcf.gz', 'rb')) - self.run = vcf.parser.pysam is not None + + def assertFetchedExpectedPositions( + self, fetched_variants, expected_positions): + fetched_positions = [var.POS for var in fetched_variants] + self.assertEqual(fetched_positions, expected_positions) + + + def testNoVariantsInRange(self): + fetched_variants = self.reader.fetch('20', 14370, 17329) + self.assertFetchedExpectedPositions(fetched_variants, []) + + + def testNoVariantsForZeroLengthInterval(self): + fetched_variants = self.reader.fetch('20', 14369, 14369) + self.assertFetchedExpectedPositions(fetched_variants, []) def testFetchRange(self): - if not self.run: - return - lines = list(self.reader.fetch('20', 14370, 14370)) - self.assertEquals(len(lines), 1) - self.assertEqual(lines[0].POS, 14370) + fetched_variants = self.reader.fetch('20', 14369, 14370) + self.assertFetchedExpectedPositions(fetched_variants, [14370]) + + fetched_variants = self.reader.fetch('20', 14369, 17330) + self.assertFetchedExpectedPositions( + fetched_variants, [14370, 17330]) - lines = list(self.reader.fetch('20', 14370, 17330)) - self.assertEquals(len(lines), 2) - self.assertEqual(lines[0].POS, 14370) - self.assertEqual(lines[1].POS, 17330) + fetched_variants = self.reader.fetch('20', 1110695, 1234567) + self.assertFetchedExpectedPositions( + fetched_variants, [1110696, 1230237, 1234567]) - lines = list(self.reader.fetch('20', 1110695, 1234567)) - self.assertEquals(len(lines), 3) + def testFetchesFromStartIfStartOnlySpecified(self): + fetched_variants = self.reader.fetch('20', 1110695) + self.assertFetchedExpectedPositions( + fetched_variants, [1110696, 1230237, 1234567]) - def testFetchSite(self): - if not self.run: - return - site = self.reader.fetch('20', 14370) - self.assertEqual(site.POS, 14370) - site = self.reader.fetch('20', 14369) - assert site is None + def testFetchesAllFromChromIfOnlyChromSpecified(self): + fetched_variants = self.reader.fetch('20') + self.assertFetchedExpectedPositions( + fetched_variants, + [14370, 17330, 1110696, 1230237, 1234567] + ) [email protected](pysam, "test requires installation of PySAM.") +class TestIssue201(unittest.TestCase): + def setUp(self): + # This file contains some non-ASCII characters in a UTF-8 encoding. + # https://github.com/jamescasbon/PyVCF/issues/201 + self.reader = vcf.Reader(fh('issue-201.vcf.gz', 'rb'), + encoding='utf-8') + + def testIterate(self): + for record in self.reader: + # Should not raise decoding errors. + pass + + def testFetch(self): + for record in self.reader.fetch(chrom='17'): + # Should not raise decoding errors. + pass class TestOpenMethods(unittest.TestCase): @@ -870,12 +1297,61 @@ class TestOpenMethods(unittest.TestCase): self.assertEqual(self.samples, r.samples) +class TestSampleFilter(unittest.TestCase): + @unittest.skipUnless(IS_PYTHON2, "test broken for Python 3") + def testCLIListSamples(self): + proc = subprocess.Popen('python scripts/vcf_sample_filter.py vcf/test/example-4.1.vcf', shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) + out, err = proc.communicate() + self.assertEqual(proc.returncode, 0) + self.assertFalse(err) + expected_out = ['Samples:', '0: NA00001', '1: NA00002', '2: NA00003'] + self.assertEqual(out.splitlines(), expected_out) + + @unittest.skipUnless(IS_PYTHON2, "test broken for Python 3") + def testCLIWithFilter(self): + proc = subprocess.Popen('python scripts/vcf_sample_filter.py vcf/test/example-4.1.vcf -f 1,2 --quiet', shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) + out, err = proc.communicate() + self.assertEqual(proc.returncode, 0) + self.assertTrue(out) + self.assertFalse(err) + buf = StringIO() + buf.write(out) + buf.seek(0) + #print(buf.getvalue()) + reader = vcf.Reader(buf) + self.assertEqual(reader.samples, ['NA00001']) + rec = next(reader) + self.assertEqual(len(rec.samples), 1) + + @unittest.skipUnless(IS_NOT_PYPY, "test broken for PyPy") + def testSampleFilterModule(self): + # init filter with filename, get list of samples + filt = vcf.SampleFilter('vcf/test/example-4.1.vcf') + self.assertEqual(filt.samples, ['NA00001', 'NA00002', 'NA00003']) + # set filter, check which samples will be kept + filtered = filt.set_filters(filters="0", invert=True) + self.assertEqual(filtered, ['NA00001']) + # write filtered file to StringIO + buf = StringIO() + filt.write(buf) + buf.seek(0) + #print(buf.getvalue()) + # undo monkey patch by destroying instance + del filt + self.assertTrue('sample_filter' not in dir(vcf.Reader)) + # read output + reader = vcf.Reader(buf) + self.assertEqual(reader.samples, ['NA00001']) + rec = next(reader) + self.assertEqual(len(rec.samples), 1) + + class TestFilter(unittest.TestCase): + @unittest.skip("test currently broken") def testApplyFilter(self): # FIXME: broken with distribute - return s, out = commands.getstatusoutput('python scripts/vcf_filter.py --site-quality 30 test/example-4.0.vcf sq') #print(out) self.assertEqual(s, 0) @@ -903,9 +1379,9 @@ class TestFilter(unittest.TestCase): self.assertEqual(n, 2) + @unittest.skip("test currently broken") def testApplyMultipleFilters(self): # FIXME: broken with distribute - return s, out = commands.getstatusoutput('python scripts/vcf_filter.py --site-quality 30 ' '--genotype-quality 50 test/example-4.0.vcf sq mgq') self.assertEqual(s, 0) @@ -925,7 +1401,7 @@ class TestRegression(unittest.TestCase): def test_issue_16(self): reader = vcf.Reader(fh('issue-16.vcf')) - n = reader.next() + n = next(reader) assert n.QUAL == None def test_null_mono(self): @@ -940,7 +1416,7 @@ class TestRegression(unittest.TestCase): out.seek(0) print(out.getvalue()) p2 = vcf.Reader(out) - rec = p2.next() + rec = next(p2) assert rec.samples @@ -1014,26 +1490,105 @@ class TestGATKMeta(unittest.TestCase): self.assertEqual(reader.metadata['GATKCommandLine'][1]['CommandLineOptions'], '"analysis_type=VariantAnnotator annotation=[HomopolymerRun, VariantType, TandemRepeatAnnotator]"') + +class TestUncalledGenotypes(unittest.TestCase): + """Test the handling of uncalled (., ./.) sample genotypes.""" + + def test_read_uncalled(self): + """Test that uncalled genotypes are properly read into + gt_nums, gt_bases, ploidity, and gt_alleles properties + of _Call objects. For uncalled _Call objects: + + - gt_nums should be None + - gt_bases should be None + - ploidity should match the input ploidity + - gt_alleles should be a list of None's with length + matching the ploidity""" + + reader = vcf.Reader(fh('uncalled_genotypes.vcf')) + for var in reader: + gt_bases = [s.gt_bases for s in var.samples] + gt_nums = [s.gt_nums for s in var.samples] + ploidity = [s.ploidity for s in var.samples] + gt_alleles = [s.gt_alleles for s in var.samples] + + if var.POS == 14370: + self.assertEqual(['0|0', None, '1/1'], gt_nums) + self.assertEqual(['G|G', None, 'A/A'], gt_bases) + self.assertEqual([2,2,2], ploidity) + self.assertEqual([['0','0'], [None,None], ['1','1']], gt_alleles) + elif var.POS == 17330: + self.assertEqual([None, '0|1', '0/0'], gt_nums) + self.assertEqual([None, 'T|A', 'T/T'], gt_bases) + self.assertEqual([3,2,2], ploidity) + self.assertEqual([[None,None,None], ['0','1'], ['0','0']], gt_alleles) + elif var.POS == 1234567: + self.assertEqual(['0/1', '0/2', None], gt_nums) + self.assertEqual(['GTC/G', 'GTC/GTCT', None], gt_bases) + self.assertEqual([2,2,1], ploidity) + self.assertEqual([['0','1'], ['0','2'], [None]], gt_alleles) + reader._reader.close() + + + def test_write_uncalled(self): + """Test that uncalled genotypes are written just as + they were read in the input file.""" + + reader = vcf.Reader(fh('uncalled_genotypes.vcf')) + + # Write all reader records to a stream. + out = StringIO() + writer = vcf.Writer(out, reader, lineterminator='\n') + for record in reader: + writer.write_record(record) + reader._reader.close() + + + # Compare the written stream to the input reader line-by-line. + out.seek(0) + out_lines = out.getvalue().split('\n') + in_file = fh('uncalled_genotypes.vcf') + in_lines = [l.rstrip('\n') for l in in_file] + in_file.close() + for (in_line, out_line) in zip(in_lines, out_lines): + self.assertEqual(in_line,out_line) + +class TestStrelka(unittest.TestCase): + + def test_strelka(self): + reader = vcf.Reader(fh('strelka.vcf')) + n = next(reader) + assert n is not None + + suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestVcfSpecs)) suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestGatkOutput)) suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestFreebayesOutput)) suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestSamtoolsOutput)) suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestBcfToolsOutput)) +suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestIssue214)) suite.addTests(unittest.TestLoader().loadTestsFromTestCase(Test1kg)) suite.addTests(unittest.TestLoader().loadTestsFromTestCase(Test1kgSites)) suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestGoNL)) +suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestStringAsFlag)) suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestInfoOrder)) suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestInfoTypeCharacter)) +suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestParseMetaLine)) suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestGatkOutputWriter)) suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestBcfToolsOutputWriter)) suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestWriterDictionaryMeta)) suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestSamplesSpace)) +suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestMetadataWhitespace)) suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestMixedFiltering)) suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestRecord)) suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestCall)) -suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestTabix)) +suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestFetch)) +suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestIssue201)) suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestOpenMethods)) +suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestSampleFilter)) suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestFilter)) suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestRegression)) suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestUtils)) suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestGATKMeta)) +suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestUncalledGenotypes)) +suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestStrelka)) diff --git a/vcf/utils.py b/vcf/utils.py index 456e5fa..2881dc2 100644 --- a/vcf/utils.py +++ b/vcf/utils.py @@ -28,7 +28,7 @@ def walk_together(*readers, **kwargs): nexts = [] for reader in readers: try: - nexts.append(reader.next()) + nexts.append(next(reader)) except StopIteration: nexts.append(None) -- Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/python-pyvcf.git _______________________________________________ debian-med-commit mailing list [email protected] http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/debian-med-commit
