[med-svn] [kineticstools] 04/05: Imported Upstream version 0.6.1+dfsg

Afif Elghraoui Sun, 18 Dec 2016 00:04:57 -0800

This is an automated email from the git hooks/post-receive script.

afif pushed a commit to branch master
in repository kineticstools.


commit 737e59669f40fa6b38ad86de5f5416419fc463ee
Author: Afif Elghraoui <[email protected]>
Date:   Wed Nov 30 23:22:35 2016 -0800

    Imported Upstream version 0.6.1+dfsg
---
 doc/manual.rst                            | 116 +++++++++++++++---------------
 kineticsTools/ReferenceUtils.py           |   3 +-
 kineticsTools/ResultWriter.py             |  63 ++++++++--------
 kineticsTools/ipdSummary.py               |  23 ++++--
 kineticsTools/summarizeModifications.py   |   4 +-
 kineticsTools/tasks/gather_kinetics_h5.py |   4 +-
 setup.py                                  |   2 +-
 test/cram/version.t                       |   2 +-
 test/test_gather_h5.py                    |  48 ++++++++++---
 test/test_outputs.py                      |  13 ++--
 test/test_tool_contract.py                |  10 +--
 11 files changed, 162 insertions(+), 126 deletions(-)

diff --git a/doc/manual.rst b/doc/manual.rst
index c9b2c21..5c4105b 100644
--- a/doc/manual.rst
+++ b/doc/manual.rst
@@ -41,7 +41,7 @@ Studies of the relationship between IPD and sequence context 
reveal that most of
 Filtering and Trimming
 ----------------------
 
-kineticsTools uses the Mapping QV generated by BLASR and stored in the cmp.h5 
file to ignore reads that aren't confidently mapped.  The default minimum 
Mapping QV required is 10, implying that BLASR has :math:`90\%` confidence that 
the read is correctly mapped. Because of the range of readlengths inherent in 
PacBio dataThis can be changed in using the --mapQvThreshold command line 
argument, or via the SMRTPortal configuration dialog for Modification 
Detection. 
+kineticsTools uses the Mapping QV generated by BLASR and stored in the cmp.h5 
or BAM file (or AlignmentSet) to ignore reads that aren't confidently mapped.  
The default minimum Mapping QV required is 10, implying that BLASR has 
:math:`90\%` confidence that the read is correctly mapped. Because of the range 
of readlengths inherent in PacBio dataThis can be changed in using the 
--mapQvThreshold command line argument, or via the SMRTPortal configuration 
dialog for Modification Detection. 
 
 There are a few features of PacBio data that require special attention in 
order to acheive good modification detection performance.
 kineticsTools inspects the alignment between the observed bases and the 
reference sequence -- in order for an IPD measurement to be included in the 
analysis, the PacBio read sequence must match the reference sequence for 
:math:`k` around the cognate base. In the current module :math:`k=1`
@@ -55,19 +55,35 @@ Statistical Testing
 -------------------
 We test the hypothesis that IPDs observed at a particular locus in the sample 
have a longer means than IPDs observed at the same locus in unmodified DNA.  If 
we have generated a Whole Genome Amplified dataset, which removes DNA 
modifications, we use a case-control, two-sample t-test.  This tool also 
provides a pre-calibrated 'synthetic control' model which predicts the 
unmodified IPD, given a 12 base sequence context. In the synthetic control case 
we use a one-sample t-test, with an adju [...]
 
+
+=============
+Example Usage
+=============
+
+Basic use with BAM input, GFF+HDF5 output::
+
+  ipdSummary aligned.bam --reference ref.fasta --identify m6A,m4C --gff 
basemods.gff --csv_h5 kinetics.h5
+
+With cmp.h5 input, methyl fraction calculation and GFF+CSV output::
+
+  ipdSummary aligned.cmp.h5 --reference ref.fasta --identify m6A,m4C 
--methylFraction --gff basemods.gff --csv kinetics.csv
+
+
 ======
 Inputs
 ======
 
-aligned_reads.cmp.h5
---------------------
+Aligned Reads
+-------------
 
-A standard cmp.h5 file contain alignments and IPD information supplies the 
kinetic data used to perform modification detection.  The standard cmp.h5 file 
of a SMRTportal jobs is data/aligned_read.cmp.h5.
+A standard PacBio alignment file - either AlignmentSet XML, BAM, or cmp.h5 -
+containing alignments and IPD information supplies the kinetic data used to 
perform modification detection.  The standard cmp.h5 file of a SMRTportal jobs 
is data/aligned_read.cmp.h5.
 
 Reference Sequence
 ------------------
 
-The tool requires the reference sequence used to perform alignments.  
Currently this must be supplied via the path to a SMRTportal reference 
repository entry.
+The tool requires the reference sequence used to perform alignments.  This can
+be either a FASTA file or a ReferenceSet XML.
 
 =======
 Outputs
@@ -77,10 +93,45 @@ The modification detection tool provides results in a 
variety of formats suitabl
 quick reference, and comsumption by visualization tools such as PacBio 
SMRTView.
 Results are generally indexed by reference position and reference strand.  In 
all cases the strand value refers to the strand carrying the modification in 
DNA sample. Remember that the kinetic effect of the modification is observed in 
read sequences aligning to the opposite strand. So reads aligning to the 
positive strand carry information about modification on the negative strand and 
vice versa, but in this toolkit we alway report the strand containing the 
putative modification.
 
+The following output options are available:
+
+  - ``--gff FILENAME``: GFF format
+  - ``--csv FILENAME``: comma-separated value format
+  - ``--csv_h5 FILENAME``: compact binary equivalent of CSV in HDF5 format
+  - ``--bigwig FILENAME``: BigWig file (mostly only useful for SMRTView)
+
+
+modifications.gff
+-----------------
+The modifications.gff is compliant with the GFF Version 3 specification 
(http://www.sequenceontology.org/gff3.shtml). Each template position / strand 
pair whose p-value exceeds the pvalue threshold appears as a row. The template 
position is 1-based, per the GFF spec.  The strand column refers to the strand 
carrying the detected modification, which is the opposite strand from those 
used to detect the modification. The GFF confidence column is a 
Phred-transformed pvalue of detection.
+
+**Note on genome browser compatibility**
+
+The modifications.gff file will not work directly with most genome browsers.  
You will likely need to make a copy of the GFF file and convert the _seqid_ 
columns from the generic 'ref0000x' names generated by PacBio, to the FASTA 
headers present in the original reference FASTA file.  The mapping table is 
written in the header of the modifications.gff file in  ``#sequence-header`` 
tags.  This issue will be resolved in the 1.4 release of kineticsTools.
+
+The auxiliary data column of the GFF file contains other statistics which may 
be useful downstream analysis or filtering.  In particular the coverage level 
of the reads used to make the call, and +/- 20bp sequence context surrounding 
the site.
+
+================  ===========
+Column      Description
+================  ===========
+seqid     Fasta contig name
+source            Name of tool -- 'kinModCall'
+type                    Modification type -- in identification mode this will 
be m6A, m4C, or m5C for identified bases, or the generic tag 'modified_base' if 
a kinetic event was detected that does not match a known modification signature
+start                   Modification position on contig
+end                     Modification position on contig
+score                   Phred transformed p-value of detection - this is the 
single-site detection p-value
+strand                  Sample strand containing modification
+phase                   Not applicable
+attributes              Extra fields relevant to base mods. IPDRatio is 
traditional IPDRatio, context is the reference sequence -20bp to +20bp around 
the modification, and coverage level is the number of IPD observations used 
after Mapping QV filtering and accuracy filtering. If the row results from an 
identified modification we also include an identificationQv tag with the from 
the modification identification procedure. identificationQv is the 
phred-transformed probability of an incorre [...]
+================  ===========
+
+
 modifications.csv
 -----------------
 The modifications.csv file contains one row for each (reference position, 
strand) pair that appeared in the dataset with coverage at least x.
-x defaults to 3, but is configurable with '--minCoverage' flag to 
ipdSummary.py. The reference position index is 1-based for compatibility with 
the gff file the R environment.  
+x defaults to 3, but is configurable with '--minCoverage' flag to 
ipdSummary.py. The reference position index is 1-based for compatibility with 
the gff file the R environment.  Note that this output type scales poorly and 
is not
+recommended for large genomes; the HDF5 output should perform much better in
+these cases.
 
 Output columns
 --------------
@@ -125,56 +176,3 @@ coverage                mean of case and control coverage
 controlCoverage         count of valid control IPDs at this position (see 
Filtering section for details)
 caseCoverage            count of valid case IPDs at this position (see 
Filtering section for details)
 ================       ===========
-
-
-
-modifications.gff
------------------
-The modifications.gff is compliant with the GFF Version 3 specification 
(http://www.sequenceontology.org/gff3.shtml). Each template position / strand 
pair whose p-value exceeds the pvalue threshold appears as a row. The template 
position is 1-based, per the GFF spec.  The strand column refers to the strand 
carrying the detected modification, which is the opposite strand from those 
used to detect the modification. The GFF confidence column is a 
Phred-transformed pvalue of detection.
-
-**Note on genome browser compatibility**
-
-The modifications.gff file will not work directly with most genome browsers.  
You will likely need to make a copy of the GFF file and convert the _seqid_ 
columns from the generic 'ref0000x' names generated by PacBio, to the FASTA 
headers present in the original reference FASTA file.  The mapping table is 
written in the header of the modifications.gff file in  ``#sequence-header`` 
tags.  This issue will be resolved in the 1.4 release of kineticsTools.
-
-The auxiliary data column of the GFF file contains other statistics which may 
be useful downstream analysis or filtering.  In particular the coverage level 
of the reads used to make the call, and +/- 20bp sequence context surrounding 
the site.
-
-
-================       ===========
-Column                 Description
-================       ===========
-seqid                  Fasta contig name
-source                 Name of tool -- 'kinModCall'
-type                    Modification type -- in identification mode this will 
be m6A, m4C, or m5C for identified bases, or the generic tag 'modified_base' if 
a kinetic event was detected that does not match a known modification signature
-start                   Modification position on contig
-end                     Modification position on contig
-score                   Phred transformed p-value of detection - this is the 
single-site detection p-value
-strand                  Sample strand containing modification
-phase                   Not applicable
-attributes              Extra fields relevant to base mods. IPDRatio is 
traditional IPDRatio, context is the reference sequence -20bp to +20bp around 
the modification, and coverage level is the number of IPD observations used 
after Mapping QV filtering and accuracy filtering. If the row results from an 
identified modification we also include an identificationQv tag with the from 
the modification identification procedure. identificationQv is the 
phred-transformed probability of an incorre [...]
-================       ===========
-
-motifs.gff
-----------
-If the Motif Finder tool is run, it will generate motifs.gff, which a 
reprocessed version of modifications.gff with the following changes. If a 
detected modification occurs on a motif detected by the motif finder, the 
modification is annotated with motif data. An attribute 'motif' is added 
containing the motif string, and an attribute 'id' is added containing the 
motif id, which is the motif string for unpaired motifs or 
'motifString1/motifString2' for paired motifs. If a motif instance  [...]
-
-
-motif_summary.csv
------------------
-
-If the Motif Finder tool is run, motif_summary.csv is generated, summarizing 
the modified motifs discovered by the tool. The CSV contains one row per 
detected motif, with the following columns
-
-==================          ===========
-Column                      Description
-==================          =========== 
-motifString                 Detected motif sequence
-centerPos                   Position in motif of modification (0-based)
-fraction                    Fraction of instances of this motif with 
modification QV above the QV threshold
-nDetected                   Number of instances of this motif with above 
threshold
-nGenome                     Number of instances of this motif in reference 
sequence
-groupTag                    A string idetifying the motif grouping. For paired 
motifs this is "<motifString1>/<motifString2>", For unpaired motifs this equals 
motifString
-partnerMotifString          motifString of paired motif (motif with 
reverse-complementary motifString)
-meanScore                   Mean Modification Qv of detected instances
-meanIpdRatio                Mean IPD ratio of detected instances
-meanCoverage                Mean coverage of detected instances
-objectiveScore              Objective score of this motif in the motif finder 
algorithm
-==================          =========== 
diff --git a/kineticsTools/ReferenceUtils.py b/kineticsTools/ReferenceUtils.py
index e63758c..ae8a5ab 100755
--- a/kineticsTools/ReferenceUtils.py
+++ b/kineticsTools/ReferenceUtils.py
@@ -75,7 +75,8 @@ class ReferenceUtils():
 
         # Mark each contig with it's ID from the cmp.h5 - match them up using 
MD5s
         for x in alignmentSet.referenceInfoTable:
-            contigDict[x.FullName].cmph5ID = x.ID
+            if x.FullName in contigDict:
+                contigDict[x.FullName].cmph5ID = x.ID
 
         return contigs
 
diff --git a/kineticsTools/ResultWriter.py b/kineticsTools/ResultWriter.py
index 0671fcd..dbcdedd 100755
--- a/kineticsTools/ResultWriter.py
+++ b/kineticsTools/ResultWriter.py
@@ -42,7 +42,7 @@ import sys
 from kineticsTools.pipelineTools import consumer
 import math
 
-
+DEFAULT_NCHUNKS = 256
 # Labels for modified fraction:
 FRAC = 'frac'
 FRAClow = 'fracLow'
@@ -359,24 +359,23 @@ class KineticsWriter(ResultCollectorProcess):
         y = [int(ref.Length) for ref in self.refInfo]
         dataLength = 2 * sum(y)
         y.append(8192)
-        chunkSize = min(dataLength, 8192 * 2)
-        # print "dataLength = ", dataLength, " chunkSize = ", chunkSize, " y = 
", y
-
-        refIdDataset = grp.create_dataset('refId', (dataLength,), dtype="u4", 
compression="gzip", chunks=(chunkSize,), compression_opts=2)
-        tplDataset = grp.create_dataset('tpl', (dataLength,), dtype="u4", 
compression="gzip", chunks=(chunkSize,), compression_opts=2)
-        strandDataset = grp.create_dataset('strand', (dataLength,), 
dtype="u1", compression="gzip", chunks=(chunkSize,), compression_opts=2)
-        baseDataset = grp.create_dataset('base', (dataLength,), dtype="a1", 
compression="gzip", chunks=(chunkSize,), compression_opts=2)
-        scoreDataset = grp.create_dataset('score', (dataLength,), dtype="u4", 
compression="gzip", chunks=(chunkSize,), compression_opts=2)
-        tMeanDataset = grp.create_dataset('tMean', (dataLength,), dtype="f4", 
compression="gzip", chunks=(chunkSize,), compression_opts=2)
-        tErrDataset = grp.create_dataset('tErr', (dataLength,), dtype="f4", 
compression="gzip", chunks=(chunkSize,), compression_opts=2)
-        modelPredictionDataset = grp.create_dataset('modelPrediction', 
(dataLength,), dtype="f4", compression="gzip", chunks=(chunkSize,), 
compression_opts=2)
-        ipdRatioDataset = grp.create_dataset('ipdRatio', (dataLength,), 
dtype="f4", compression="gzip", chunks=(chunkSize,), compression_opts=2)
-        coverageDataset = grp.create_dataset('coverage', (dataLength,), 
dtype="u4", compression="gzip", chunks=(chunkSize,), compression_opts=2)
+        nChunks = min(dataLength, DEFAULT_NCHUNKS)
+
+        refIdDataset = grp.create_dataset('refId', (dataLength,), dtype="u4", 
compression="gzip", chunks=(nChunks,), compression_opts=2)
+        tplDataset = grp.create_dataset('tpl', (dataLength,), dtype="u4", 
compression="gzip", chunks=(nChunks,), compression_opts=2)
+        strandDataset = grp.create_dataset('strand', (dataLength,), 
dtype="u1", compression="gzip", chunks=(nChunks,), compression_opts=2)
+        baseDataset = grp.create_dataset('base', (dataLength,), dtype="a1", 
compression="gzip", chunks=(nChunks,), compression_opts=2)
+        scoreDataset = grp.create_dataset('score', (dataLength,), dtype="u4", 
compression="gzip", chunks=(nChunks,), compression_opts=2)
+        tMeanDataset = grp.create_dataset('tMean', (dataLength,), dtype="f4", 
compression="gzip", chunks=(nChunks,), compression_opts=2)
+        tErrDataset = grp.create_dataset('tErr', (dataLength,), dtype="f4", 
compression="gzip", chunks=(nChunks,), compression_opts=2)
+        modelPredictionDataset = grp.create_dataset('modelPrediction', 
(dataLength,), dtype="f4", compression="gzip", chunks=(nChunks,), 
compression_opts=2)
+        ipdRatioDataset = grp.create_dataset('ipdRatio', (dataLength,), 
dtype="f4", compression="gzip", chunks=(nChunks,), compression_opts=2)
+        coverageDataset = grp.create_dataset('coverage', (dataLength,), 
dtype="u4", compression="gzip", chunks=(nChunks,), compression_opts=2)
 
         if self.options.methylFraction:
-            fracDataset = grp.create_dataset(FRAC, (dataLength,), dtype="f4", 
compression="gzip", chunks=(chunkSize,), compression_opts=2)
-            fracLowDataset = grp.create_dataset(FRAClow, (dataLength,), 
dtype="f4", compression="gzip", chunks=(chunkSize,), compression_opts=2)
-            fracUpDataset = grp.create_dataset(FRACup, (dataLength,), 
dtype="f4", compression="gzip", chunks=(chunkSize,), compression_opts=2)
+            fracDataset = grp.create_dataset(FRAC, (dataLength,), dtype="f4", 
compression="gzip", chunks=(nChunks,), compression_opts=2)
+            fracLowDataset = grp.create_dataset(FRAClow, (dataLength,), 
dtype="f4", compression="gzip", chunks=(nChunks,), compression_opts=2)
+            fracUpDataset = grp.create_dataset(FRACup, (dataLength,), 
dtype="f4", compression="gzip", chunks=(nChunks,), compression_opts=2)
 
         try:
             while True:
@@ -495,25 +494,25 @@ class KineticsWriter(ResultCollectorProcess):
         for ref in self.refInfo:
             # Each reference group will house a collection of datasets:
             dataLength = ref.Length * 2
-            chunkSize = min(dataLength, 8192)
+            nChunks = min(dataLength, DEFAULT_NCHUNKS)
 
             # Create a group for each reference:
             grp = f.create_group(str(ref.Name))
 
-            ds = grp.create_dataset('tpl', (dataLength,), dtype="u4", 
compression="gzip", chunks=(chunkSize,))
-            ds = grp.create_dataset('strand', (dataLength,), dtype="u1", 
compression="gzip", chunks=(chunkSize,))
-            ds = grp.create_dataset('base', (dataLength,), dtype="a1", 
compression="gzip", chunks=(chunkSize,))
-            ds = grp.create_dataset('score', (dataLength,), dtype="u4", 
compression="gzip", chunks=(chunkSize,))
-            ds = grp.create_dataset('tMean', (dataLength,), dtype="f4", 
compression="gzip", chunks=(chunkSize,))
-            ds = grp.create_dataset('tErr', (dataLength,), dtype="f4", 
compression="gzip", chunks=(chunkSize,))
-            ds = grp.create_dataset('modelPrediction', (dataLength,), 
dtype="f4", compression="gzip", chunks=(chunkSize,))
-            ds = grp.create_dataset('ipdRatio', (dataLength,), dtype="f4", 
compression="gzip", chunks=(chunkSize,))
-            ds = grp.create_dataset('coverage', (dataLength,), dtype="u4", 
compression="gzip", chunks=(chunkSize,))
+            ds = grp.create_dataset('tpl', (dataLength,), dtype="u4", 
compression="gzip", chunks=(nChunks,))
+            ds = grp.create_dataset('strand', (dataLength,), dtype="u1", 
compression="gzip", chunks=(nChunks,))
+            ds = grp.create_dataset('base', (dataLength,), dtype="a1", 
compression="gzip", chunks=(nChunks,))
+            ds = grp.create_dataset('score', (dataLength,), dtype="u4", 
compression="gzip", chunks=(nChunks,))
+            ds = grp.create_dataset('tMean', (dataLength,), dtype="f4", 
compression="gzip", chunks=(nChunks,))
+            ds = grp.create_dataset('tErr', (dataLength,), dtype="f4", 
compression="gzip", chunks=(nChunks,))
+            ds = grp.create_dataset('modelPrediction', (dataLength,), 
dtype="f4", compression="gzip", chunks=(nChunks,))
+            ds = grp.create_dataset('ipdRatio', (dataLength,), dtype="f4", 
compression="gzip", chunks=(nChunks,))
+            ds = grp.create_dataset('coverage', (dataLength,), dtype="u4", 
compression="gzip", chunks=(nChunks,))
 
             if self.options.methylFraction:
-                ds = grp.create_dataset(FRAC, (dataLength,), dtype="f4", 
compression="gzip", chunks=(chunkSize,))
-                ds = grp.create_dataset(FRAClow, (dataLength,), dtype="f4", 
compression="gzip", chunks=(chunkSize,))
-                ds = grp.create_dataset(FRACup, (dataLength,), dtype="f4", 
compression="gzip", chunks=(chunkSize,))
+                ds = grp.create_dataset(FRAC, (dataLength,), dtype="f4", 
compression="gzip", chunks=(nChunks,))
+                ds = grp.create_dataset(FRAClow, (dataLength,), dtype="f4", 
compression="gzip", chunks=(nChunks,))
+                ds = grp.create_dataset(FRACup, (dataLength,), dtype="f4", 
compression="gzip", chunks=(nChunks,))
 
             # Maintain a dictionary of group paths?
             dsDict[ref.ID] = grp
@@ -633,12 +632,12 @@ class KineticsWriter(ResultCollectorProcess):
             # FIXME -- create with good chunk parameters, activate compression
             log.info("Creating IpdRatio dataset w/ name: %s, Size: %d" % 
(str(ref.Name), ref.Length))
 
-            chunkSize = min(ref.Length, 8192)
+            nChunks = min(ref.Length, DEFAULT_NCHUNKS)
 
             ds = f.create_dataset(str(ref.Name), (ref.Length,),
                                   dtype="u4",
                                   compression='gzip',
-                                  chunks=(chunkSize,))
+                                  chunks=(nChunks,))
 
             dsDict[ref.ID] = ds
 
diff --git a/kineticsTools/ipdSummary.py b/kineticsTools/ipdSummary.py
index 6bfd088..c93e369 100755
--- a/kineticsTools/ipdSummary.py
+++ b/kineticsTools/ipdSummary.py
@@ -126,16 +126,16 @@ def get_parser():
         type=validateFile, help="Fasta or Reference DataSet")
     # XXX GFF and CSV are "option" for arg parser, not tool contract
     tcp.add_output_file_type(FileTypes.GFF, "gff",
-        name="Modified bases GFF",
-        description="GFF file of modified bases",
+        name="Modifications",
+        description="Summary of analysis results for each kinModCall",
         default_name="basemods")
     tcp.add_output_file_type(FileTypes.BIGWIG, "bigwig",
-        name="BigWig file encoding base IpdRatios",
-        description="Compressed binary format containing the IpdRatios for 
every base (both strands)",
+        name="IPD Ratios",
+        description="BigWig file encoding base IpdRatios",
         default_name="basemods")
     tcp.add_output_file_type(FileTypes.H5, "csv_h5",
-        name="HDF5 file containing per-base information",
-        description="HDF5 equivalent of CSV output",
+        name="Full Kinetics Summary",
+        description="HDF5 file containing per-base information",
         default_name="basemods")
     argp.add_argument("--gff", action="store", default=None,
         help="Output GFF file of modified bases")
@@ -567,6 +567,17 @@ class KineticsToolsRunner(object):
                 sys.exit(1)
 
             majorityChem = 
ReferenceUtils.loadAlignmentChemistry(self.alignments)
+
+            # Temporary solution for Sequel chemistries: we do not
+            # have trained kinetics models in hand yet for Sequel
+            # chemistries.  However we have observed that the P5-C3
+            # training seems to yield fairly good results on Sequel
+            # chemistries to date.  So for the moment, we will use
+            # that model for Sequel data.
+            if majorityChem.startswith("S/"):
+                logging.info("No trained model available yet for Sequel 
chemistries; modeling as P5-C3")
+                majorityChem = "P5-C3"
+
             ipdModel = os.path.join(self.args.paramsPath, majorityChem + ".h5")
             if majorityChem == 'unknown':
                 logging.error("Chemistry cannot be identified---cannot perform 
kinetic analysis")
diff --git a/kineticsTools/summarizeModifications.py 
b/kineticsTools/summarizeModifications.py
index 2f2c669..5581314 100755
--- a/kineticsTools/summarizeModifications.py
+++ b/kineticsTools/summarizeModifications.py
@@ -178,8 +178,8 @@ def get_parser():
         name="GFF file",
         description="Alignment summary GFF")
     p.add_output_file_type(FileTypes.GFF, "gff_out",
-        name="Alignment summary with basemods",
-        description="Modified alignment summary file",
+        name="Coverage and Base Modifications Summary",
+        description="Coverage summary for regions (bins) spanning the 
reference with basemod results for each region",
         default_name="alignment_summary_with_basemods")
     return p
 
diff --git a/kineticsTools/tasks/gather_kinetics_h5.py 
b/kineticsTools/tasks/gather_kinetics_h5.py
index 76703f5..2115846 100644
--- a/kineticsTools/tasks/gather_kinetics_h5.py
+++ b/kineticsTools/tasks/gather_kinetics_h5.py
@@ -15,6 +15,8 @@ from pbcommand.cli import pbparser_runner
 from pbcommand.models import get_gather_pbparser, FileTypes
 from pbcommand.utils import setup_log
 
+from kineticsTools.ResultWriter import DEFAULT_NCHUNKS
+
 log = logging.getLogger(__name__)
 
 
@@ -122,7 +124,7 @@ def gather_kinetics_h5_byref(chunked_files, output_file):
                  s=ref_sizes[ref_name]))
         grp = out.create_group(ref_name)
         dataLength = ref_sizes[ref_name]
-        chunkSize = min(dataLength, 8192)
+        chunkSize = min(dataLength, DEFAULT_NCHUNKS)
         for ds_id, ds_type in ds_types.iteritems():
             log.debug("  dataset {i} ({t})".format(i=ds_id, t=ds_type))
             ds = grp.create_dataset(ds_id, (dataLength,), dtype=ds_type,
diff --git a/setup.py b/setup.py
index e1eb4bf..cf697c8 100755
--- a/setup.py
+++ b/setup.py
@@ -4,7 +4,7 @@ import sys
 
 setup(
     name='kineticsTools',
-    version='0.6.0',
+    version='0.6.1',
     author='Pacific Biosciences',
     author_email='[email protected]',
     license=open('LICENSES.txt').read(),
diff --git a/test/cram/version.t b/test/cram/version.t
index 91ce3dd..bb10400 100644
--- a/test/cram/version.t
+++ b/test/cram/version.t
@@ -1,7 +1,7 @@
 A simple test of the version and help options:
 
   $ ipdSummary --version
-  2.2
+  2.3
 
   $ ipdSummary
   usage: ipdSummary [-h] [--version] [--emit-tool-contract]
diff --git a/test/test_gather_h5.py b/test/test_gather_h5.py
index 954fa26..1a4310f 100644
--- a/test/test_gather_h5.py
+++ b/test/test_gather_h5.py
@@ -1,4 +1,6 @@
 
+# TODO test with more than one simulated reference contig
+
 import unittest
 import tempfile
 
@@ -9,7 +11,8 @@ import pbcommand.testkit.core
 from pbcommand.models import PipelineChunk
 from pbcommand.pb_io.common import write_pipeline_chunks
 
-from kineticsTools.tasks.gather_kinetics_h5 import gather_kinetics_h5
+from kineticsTools.tasks.gather_kinetics_h5 import (gather_kinetics_h5,
+                                                    gather_kinetics_h5_byref)
 
 class SetUpHDF5(object):
     DATALENGTH = 10000
@@ -20,16 +23,24 @@ class SetUpHDF5(object):
     ]
 
     @classmethod
+    def get_base_group_input(cls, f):
+        return f.create_group("chr1")
+
+    def get_base_group_output(self, f):
+        return f[f.keys()[0]]
+
+    @classmethod
     def makeInputs(cls):
         for k, ifn in enumerate(cls.CHUNKED_FILES):
             f = h5py.File(cls.CHUNKED_FILES[k], "w")
-            a = f.create_dataset("base", (cls.DATALENGTH,),
+            g = cls.get_base_group_input(f)
+            a = g.create_dataset("base", (cls.DATALENGTH,),
                                  dtype="a1", compression="gzip",
                                  chunks=(cls.CHUNKSIZE,), compression_opts=2)
-            b = f.create_dataset("score", (cls.DATALENGTH,),
+            b = g.create_dataset("score", (cls.DATALENGTH,),
                                  dtype="u4", compression="gzip",
                                  chunks=(cls.CHUNKSIZE,), compression_opts=2)
-            c = f.create_dataset("tMean", (cls.DATALENGTH,),
+            c = g.create_dataset("tMean", (cls.DATALENGTH,),
                                  dtype="f4", compression="gzip",
                                  chunks=(cls.CHUNKSIZE,), compression_opts=2)
             start = k * (cls.DATALENGTH / 2)
@@ -46,18 +57,35 @@ class TestGatherHDF5(unittest.TestCase, SetUpHDF5):
     def setUpClass(cls):
         cls.makeInputs()
 
+    def _gather(self, ofn):
+        gather_kinetics_h5_byref(self.CHUNKED_FILES, ofn)
+
     def test_gather_kinetics_h5(self):
         ofn = tempfile.NamedTemporaryFile(suffix=".h5").name
-        gather_kinetics_h5(self.CHUNKED_FILES, ofn)
+        self._gather(ofn)
         f = h5py.File(ofn)
-        self.assertTrue(all(f['base'].__array__() == 'A'))
-        self.assertTrue(all(f['score'].__array__() > 0))
-        self.assertEqual(f['score'].__array__().mean(), 3)
-        self.assertTrue(all(f['tMean'].__array__() > 0))
-        d = np.round(f['tMean'].__array__() ** 2).astype("u4")
+        g = self.get_base_group_output(f)
+        self.assertTrue(all(g['base'].__array__() == 'A'))
+        self.assertTrue(all(g['score'].__array__() > 0))
+        self.assertEqual(g['score'].__array__().mean(), 3)
+        self.assertTrue(all(g['tMean'].__array__() > 0))
+        d = np.round(g['tMean'].__array__() ** 2).astype("u4")
         self.assertTrue(all(d == np.array(range(1, self.DATALENGTH+1))))
 
 
+class TestGatherHDF5Flat(TestGatherHDF5):
+
+    @classmethod
+    def get_base_group_input(cls, f):
+        return f
+
+    def get_base_group_output(self, f):
+        return f
+
+    def _gather(self, ofn):
+        gather_kinetics_h5(self.CHUNKED_FILES, ofn)
+
+
 class TestGatherH5ToolContract(pbcommand.testkit.core.PbTestGatherApp, 
SetUpHDF5):
     DRIVER_BASE = "python -m kineticsTools.tasks.gather_kinetics_h5 "
     INPUT_FILES = [tempfile.NamedTemporaryFile(suffix=".json").name]
diff --git a/test/test_outputs.py b/test/test_outputs.py
index 03430de..59565c2 100644
--- a/test/test_outputs.py
+++ b/test/test_outputs.py
@@ -63,22 +63,23 @@ class TestOutputs(unittest.TestCase):
 
     def test_h5_output(self):
         f = h5py.File(self.h5_file)
-        bases = f['base'].__array__()
+        g = f[f.keys()[0]] # just one group in this file
+        bases = g['base'].__array__()
         self.assertEqual(bases[0], "A")
         self.assertEqual(bases[100], "T")
         self.assertEqual(list(bases[0:400] != "").count(True), 400)
-        seqh_fwd = ''.join([f['base'][x*2] for x in range(200)])
+        seqh_fwd = ''.join([g['base'][x*2] for x in range(200)])
         seqc_fwd = ''.join([self.csv_records[x*2][3] for x in range(200)])
         self.assertEqual(seqh_fwd, seqc_fwd)
-        seqh_rev = ''.join([f['base'][(x*2)+1] for x in range(200)])
+        seqh_rev = ''.join([g['base'][(x*2)+1] for x in range(200)])
         seqc_rev = ''.join([self.csv_records[(x*2)+1][3] for x in range(200)])
         self.assertEqual(seqh_rev, seqc_rev)
-        tpl_fwd = [f['tpl'][x*2] for x in range(200)]
+        tpl_fwd = [g['tpl'][x*2] for x in range(200)]
         self.assertEqual(tpl_fwd, range(1, 201))
         f = h5py.File(self.h5_file)
         for i_rec, rec in enumerate(self.csv_records):
-            self.assertEqual(str(f['tpl'][i_rec]), self.csv_records[i_rec][1])
-            self.assertEqual("%.3f"%f['ipdRatio'][i_rec],
+            self.assertEqual(str(g['tpl'][i_rec]), self.csv_records[i_rec][1])
+            self.assertEqual("%.3f"%g['ipdRatio'][i_rec],
                              self.csv_records[i_rec][8])
 
     def test_csv_output(self):
diff --git a/test/test_tool_contract.py b/test/test_tool_contract.py
index 023d585..0f06e51 100755
--- a/test/test_tool_contract.py
+++ b/test/test_tool_contract.py
@@ -21,11 +21,6 @@ REF_DIR = 
"/pbi/dept/secondary/siv/references/Helicobacter_pylori_J99"
 
 class Constants(object):
     N_LINES_GFF = 338
-    N_LINES_CSV = 13357
-    INITIAL_LINES_CSV = """\
-refName,tpl,strand,base,score,tMean,tErr,modelPrediction,ipdRatio,coverage
-"gi|12057207|gb|AE001439.1|",1,0,A,10,2.387,0.464,1.710,1.396,29
-"gi|12057207|gb|AE001439.1|",1,1,T,1,0.492,0.075,0.602,0.817,57"""
     INITIAL_LINES_GFF = """\
 
gi|12057207|gb|AE001439.1|\tkinModCall\tm6A\t35\t35\t187\t-\t.\tcoverage=118;context=TTTAAGGGCGTTTTATGCCTAAATTTAAAAAATGATGCTGT;IPDRatio=5.68;identificationQv=196
 
gi|12057207|gb|AE001439.1|\tkinModCall\tm4C\t60\t60\t49\t-\t.\tcoverage=112;context=AAAAAGCTCGCTCAAAAACCCTTGATTTAAGGGCGTTTTAT;IPDRatio=2.58;identificationQv=33
@@ -64,8 +59,9 @@ class TestIpdSummary(pbcommand.testkit.PbTestApp):
             return "\n".join(out)
         self.assertEqual(head2(gff_file, 3), Constants.INITIAL_LINES_GFF)
         f = h5py.File(rtc.task.output_files[2])
-        seqh_fwd = ''.join([f['base'][x*2] for x in range(5000)])
-        seqh_rev = ''.join([f['base'][(x*2)+1] for x in range(5000)])
+        g = f[f.keys()[0]]
+        seqh_fwd = ''.join([g['base'][x*2] for x in range(5000)])
+        seqh_rev = ''.join([g['base'][(x*2)+1] for x in range(5000)])
         self.assertEqual(len(seqh_fwd), 5000)
         self.assertEqual(len(seqh_rev), 5000)
 

-- 
Alioth's /usr/local/bin/git-commit-notice on 
/srv/git.debian.org/git/debian-med/kineticstools.git

_______________________________________________
debian-med-commit mailing list
[email protected]
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/debian-med-commit

[med-svn] [kineticstools] 04/05: Imported Upstream version 0.6.1+dfsg

Reply via email to