This is an automated email from the git hooks/post-receive script. afif pushed a commit to branch master in repository pbgenomicconsensus.
commit 5e744427abd554ff0ebf50310f91835a6212d83f Author: Afif Elghraoui <[email protected]> Date: Fri Jul 8 08:17:43 2016 -0700 Redo and reorganize manpages * Drop rst patches-- copy doc sources and maintain them independently * Add helper rules to d/rules --- debian/control | 2 +- debian/man/arrow.1 | 22 ++ debian/man/genomicconsensus.7.md | 396 ++++++++++++++++++++++++++++++++++ debian/man/pbgff.5.md | 164 ++++++++++++++ debian/man/plurality.1 | 7 +- debian/man/poa.1 | 22 ++ debian/man/quiver.1 | 22 ++ debian/man/variantCaller.1.md | 254 ++++++++++++++++++++++ debian/patches/manpages.patch | 432 ------------------------------------- debian/patches/series | 1 - debian/pbgenomicconsensus.manpages | 4 +- debian/rules | 38 +++- 12 files changed, 920 insertions(+), 444 deletions(-) diff --git a/debian/control b/debian/control index 33f40c0..bcef98c 100644 --- a/debian/control +++ b/debian/control @@ -6,7 +6,7 @@ Uploaders: Afif Elghraoui <[email protected]> Build-Depends: debhelper (>= 9), dh-python, - python-docutils, + pandoc, python-pkg-resources, python-all, python-setuptools, diff --git a/debian/man/arrow.1 b/debian/man/arrow.1 new file mode 100644 index 0000000..9ef4244 --- /dev/null +++ b/debian/man/arrow.1 @@ -0,0 +1,22 @@ +.TH ARROW 1 + +.SH NAME + +arrow \- HMM-based consensus caller for Pacific Biosciences data + +.SH DESCRIPTION + +See +.BR genomicconsensus (7) +for a more detailed description of \f[B]arrow\f[P]. + +.SH OPTIONS + +.TP +.B \-h, \-\-help +Show summary of options. + +.SH SEE ALSO +.BR genomicconsensus (7) +.BR variantCaller (1) +.BR blasr (1) diff --git a/debian/man/genomicconsensus.7.md b/debian/man/genomicconsensus.7.md new file mode 100644 index 0000000..a544112 --- /dev/null +++ b/debian/man/genomicconsensus.7.md @@ -0,0 +1,396 @@ +% GENOMICCONSENSUS(7) +% +% + + +What are EviCons? GenomicConsensus? Quiver? Plurality? +====================================================== + +**GenomicConsensus** is the current PacBio consensus and variant calling +suite. It contains a main driver program, `variantCaller`, which +provides two consensus/variant calling algorithms: **Arrow** and +**Quiver**. These algorithms can be run by calling +`variantCaller --algorithm=[arrow|quiver|plurality]` or by going +through the convenience wrapper scripts `quiver` and `arrow`. + +**EviCons** was the previous generation PacBio variant caller (removed +: in software release v1.3.1). + +Separate packages called **ConsensusCore** and **ConsensusCore2** are +C++ libraries where all the computation behind Quiver and Arrow are +done, respectively. This is transparent to the user after installation. + +What is Plurality? +================== + +**Plurality** is a very simple variant calling algorithm: it stacks up +the aligned reads (alignment as produced by BLASR, or alternate mapping +tool), and for each column under a reference base, calls the most +abundant (i.e., the plurality) read base (or bases, or deletion) as the +consensus at that reference position. + +Why is Plurality a weak algorithm? +================================== + +Plurality does not perform any local realignment. This means it is +heavily biased by the alignment produced by the mapper (BLASR, +typically). It also means that it is insensitive at detecting indels. +Consider this example: + + Reference AAAA + ---- + Aligned A-AA + reads AA-A + -AAA + ---- + Plurality AAAA + consensus + +Note here that every read has a deletion and the correct consensus call +would be "AAA", but due to the mapper's freedom in gap-placement at the +single-read level, the plurality sequence is "AAAA"---so the deletion is +missed. Local realignment, which plurality does not do, but which could +be considered as implicit in the Quiver algorithm, essentially pushes +the gaps here to the same column, thus identifying the deletion. While +plurality could be adjusted to use a simple "gap normalizing" +realignment, in practice noncognate extras (spurious non-homopolymer +base calls) in the midst of homopolymer runs pose challenges. + +What is Quiver? +=============== + +**Quiver** is a more sophisticated algorithm that finds the maximum +quasi-likelihood template sequence given PacBio reads of the template. +PacBio reads are modeled using a conditional random field approach that +scores the quasi-likelihood of a read given a template sequence. In +addition to the base sequence of each read, Quiver uses several +additional *QV* covariates that the basecaller provides. Using these +covariates provides additional information about each read, allowing +more accurate consensus calls. + +Quiver does not use the alignment provided by the mapper (BLASR, +typically), except for determining how to group reads together at a +macro level. It implicitly performs its own realignment, so it is highly +sensitive to all variant types, including indels---for example, it +resolves the example above with ease. + +The name **Quiver** reflects a consensus-calling algorithm that is +QV-aware. + +We use the lowercase "quiver" to denote the quiver *tool* in +GenomicConsensus, which applies the Quiver algorithm to mapped reads to +derive sequence consensus and variants. + +Quiver is described in detail in the supplementary material to the [HGAP +paper](http://www.nature.com/nmeth/journal/v10/n6/full/nmeth.2474.html). + +What is Arrow? +============== + +Arrow is a newer model intended to supercede Quiver in the near future. +The key differences from Quiver are that it uses an HMM model instead of +a CRF, it computes true likelihoods, and it uses a smaller set of +covariates. We expect a whitepaper on Arrow to be available soon. + +We use the lowercase "arrow" to denote the arrow *tool*, which applies +the Arrow algorithm to mapped reads to derive sequence consensus and +variants. + +How do I run quiver/arrow? +========================== + +For general instructions on installing and running, see the +[HowTo](./HowTo.rst) document. + +What is the output from quiver/arrow? +===================================== + +There are three output files from the GenomicConsensus tools: + +1. A consensus *FASTA* file containing the consensus sequence +2. A consensus *FASTQ* file containing the consensus sequence with + quality annotations +3. A variants *GFF* file containing a filtered, annotated list of + variants identified + +It is important to note that the variants included in the output +variants GFF file are *filtered* by coverage and quality, so not all +variants that are apparent in comparing the reference to the consensus +FASTA output will correspond to variants in the output variants GFF +file. + +To enable all output files, the following can be run (for example): + + % quiver -j16 aligned_reads.cmp.h5 -r ref.fa \ + -o consensus.fa \ + -o consensus.fq \ + -o variants.gff + +The extension is used to determine the output file format. + +What does it mean that quiver consensus is *de novo*? +===================================================== + +Quiver's consensus is *de novo* in the sense that the reference and the +reference alignment are not used to inform the consensus output. Only +the reads factor into the determination of the consensus. + +The only time the reference sequence is used to make consensus calls +-when the `--noEvidenceConsensusCall` flag is set to `reference` or +`lowercasereference` (the default)- is when there is no effective +coverage in a genomic window, so Quiver has no evidence for computing +consensus. One can set `--noEvidenceConsensusCall=nocall` to avoid using +the reference even in zero coverage regions. + +What is the expected quiver accuracy? +===================================== + +Quiver's expected accuracy is a function of coverage and chemistry. The +C2 chemistry (no longer available), P6-C4 and P4-C2 chemistries provide +the most accuracy. Nominal consensus accuracy levels are as follows: + + -------------------------------------------- + Coverage + Expected consensus accuracy + ------------------+------------+ + C2, P4-C2, P6-C4 | P5-C3 + ----------- -------------------------------- + 10x \> Q30 | \> Q30 + + 20x \> Q40 | \> Q40 + + 40x \> Q50 | \> Q45 + + 60-80x \~ Q60 | \> Q55 + -------------------------------------------- + +The "Q" values referred to are Phred-scaled quality values: + +$$q = -10 \log_{10} p_{error}$$ + +for instance, Q50 corresponds to a p\_error of 0.00001---an accuracy of +99.999%. These accuracy expectations are based on routine validations +performed on multiple bacterial genomes before each chemistry release. + +What is the expected accuracy from arrow +======================================== + +arrow achieves similar accuracy to quiver. Numbers will be published +soon. + +What are the residual errors after applying quiver? +=================================================== + +If there are errors remaining applying Quiver, they will almost +invariably be homopolymer run-length errors (insertions or deletions). + +Does quiver/arrow need to know what sequencing chemistry was used? +================================================================== + +At present, the Quiver model is trained per-chemistry, so it is very +important that Quiver knows the sequencing chemistries used. + +If SMRT Analysis software was used to build the cmp.h5 or BAM input +file, the cmp.h5 will be loaded with information about the sequencing +chemistry used for each SMRT Cell, and GenomicConsensus will +automatically identify the right parameters to use. + +If custom software was used to build the cmp.h5, or an override of +Quiver's autodetection is desired, then the chemistry or model must be +explicity entered. For example: + + % quiver -p P4-C2 ... + % quiver -p P4-C2.AllQVsMergingByChannelModel ... + +Can a mix of chemistries be used in a cmp.h5 file for quiver/arrow? +=================================================================== + +Yes! GenomicConsensus tools automatically see the chemistry *per-SMRT +Cell*, so it can figure out the right parameters for each read and model +them appropriately. + +What chemistries and chemistry mixes are supported? +=================================================== + +For Quiver: all PacBio RS chemistries are supported. Chemistry mixtures +of P6-C4, P4-C2, P5-C3, and C2 are supported. + +For Arrow: the RS chemistry P6-C4, and all PacBio Sequel chemistries are +supported. Mixes of these chemistries are supported. + +What are the QVs that the Quiver model uses? +============================================ + +Quiver uses additional QV tracks provided by the basecaller. These QVs +may be looked at as little breadcrumbs that are left behind by the +basecaller to help identify positions where it was likely that errors of +a given type occurred. Formally, the QVs for a given read are vectors of +the same length as the number of bases called; the QVs used are as +follows: + +> - DeletionQV +> - InsertionQV +> - MergeQV +> - SubstitutionQV +> - DeletionTag + +To find out if your cmp.h5 file is loaded with these QV tracks, run the +command : + + % h5ls -rv aligned_reads.cmp.h5 + +and look for the QV track names in the output. If your cmp.h5 file is +lacking some of these tracks, Quiver will still run, though it will +issue a warning that its performance will be suboptimal. + +Why is quiver/arrow making errors in some region? +================================================= + +The most likely cause for *true* errors made by these tools is that the +coverage in the region was low. If there is 5x coverage over a 1000-base +region, then 10 errors in that region can be expected. + +It is important to understand that the effective coverage available to +quiver/arrow is not the full coverage apparent in plots---the tools +filter out ambiguously mapped reads by default. The remaining coverage +after filtering is called the /effective coverage/. See the next section +for discussion of MapQV. + +If you have verified that there is high effective coverage in the region +in question, it is highly possible---given the high accuracy quiver and +arrow can achieve---that the apparent errors actually reflect true +sequence variants. Inspect the FASTQ output file to ensure that the +region was called at high confidence; if an erroneous sequence variant +is being called at high confidence, please report a bug to us. + +What does Quiver do for genomic regions with no effective coverage? +=================================================================== + +For regions with no effective coverage, no variants are outputted, and +the FASTQ confidence is 0. + +The output in the FASTA and FASTQ consensus sequence tracks is dependent +on the setting of the `--noEvidenceConsensusCall` flag. Assuming the +reference in the window is "ACGT", the options are: + + --------------------------------------------------------- + `--noEvidenceConsensusCall=...` Consensus + output + ---------------------------------------------- ---------- + `nocall` (default in 1.4) NNNN + + `reference` ACGT + + `lowercasereference` (new post 1.4, and the acgt + default) + --------------------------------------------------------- + +What is MapQV and why is it important? +====================================== + +MapQV is a single scalar Phred-scaled QV per aligned read that reflects +the mapper's degree of certainty that the read aligned to *this* part of +the reference and not some other. Unambigously mapped reads will have a +high MapQV (typically 255), while a read that was equally likely to have +come from two parts of the reference would have a MapQV of 3. + +MapQV is pretty important when you want highly accurate variant calls. +Quiver and Plurality both filter out aligned reads with a MapQV below 20 +(by default), so as not to call a variant using data of uncertain +genomic origin. + +This can be problematic if using quiver/arrow to get a consensus +sequence. If the genome of interest contains long (relative to the +library insert size) highly-similar repeats, the effective coverage +(after MapQV filtering) may be reduced in the repeat regions---this is +termed these MapQV dropouts. If the coverage is sufficiently reduced in +these regions, quiver/arrow will not call consensus in these +regions---see +What do quiver/arrow do for genomic regions with no effective coverage?\_. + +If you want to use ambiguously mapped reads in computing a consensus for +a denovo assembly, the MapQV filter can be turned off entirely. In this +case, the consensus for each instance of a genomic repeat will be +calculated using reads that may actually be from other instances of the +repeat, so the exact trustworthiness of the consensus in that region may +be suspect. The next section describes how to disable the MapQV filter. + +How can the MapQV filter be turned off and when should it be? +--------------------------------------------------------------The MapQV +filter can be disabled using the flag `--mapQvThreshold=0` (shorthand: +`-m=0`). If running a quiver/arrow job via SMRT Portal, this can be done +by unchecking the "Use only unambiguously mapped reads" option. Consider +this in de novo assembly projects, but it is not recommended for variant +calling applications. + +How can variant calls made by quiver/arrow be inspected or validated? +===================================================================== + +When in doubt, it is easiest to inspect the region in a tool like SMRT +View, which enables you to view the reads aligned to the region. +Deletions and substitutions should be fairly easy to spot; to view +insertions, right-click on the reference base and select "View +Insertions Before...". + +What are the filtering parameters that quiver/arrow use? +======================================================== + +The available options limit read coverage, filters reads by MapQV, and +filters variants by quality and coverage. + +- The overall read coverage used to call consensus in every window is + 100x by default, but can be changed using `-X=value`. +- The MapQV filter, by default, removes reads with MapQV \< 20. This + is configured using `--mapQvThreshold=value` / `-m=value` +- Variants are only called if the read coverage of the site exceeds + 5x, by default---this is configurable using `-x=value`. Further, + they will not be called if the confidence (Phred-scaled) does not + exceed 40---configurable using `-q=value`. + +What happens when the sample is a mixture, or diploid? +-----------------------------------------------------At present, +quiver/arrow assume a haploid sample, and the behavior of on sample +mixtures or diploid/polyploid samples is *undefined*. The program will +not crash, but the output results are not guaranteed to accord with any +one of the haplotypes in the sample, as opposed to a potential +patchwork. + +Why would I want to *iterate* the mapping+(quiver/arrow) process? +================================================================= + +Some customers using quiver for polishing highly repetitive genomes have +found that if they take the consensus FASTA output of quiver, use it as +a new reference, and then perform mapping and Quiver again to get a new +consensus, they get improved results from the second round of quiver. + +This can be explained by noting that the output of the first round of +quiver is more accurate than the initial draft consensus output by the +assembler, so the second round's mapping to the quiver consensus can be +more sensitive in mapping reads from repetitive regions. This can then +result in improved consensus in those repetitive regions, because the +reads have been assigned more correctly to their true genomic loci. +However there is also a possibility that the potential shifting of reads +around from one rounds' mapping to the next might alter borderline (low +confidence) consensus calls even away from repetitive regions. + +We recommend the (mapping+quiver) iteration for customers polishing +repetitive genomes, and it could also prove useful for resequencing +applications. However we caution that this is very much an *exploratory* +procedure and we make no guarantees about its performance. In +particular, borderline consensus calls can change when the procedure is +iterated, and the procedure is *not* guaranteed to be convergent. + +Is iterating the (mapping+quiver/arrow) process a convergent procedure? +======================================================================= + +We have seen many examples where (mapping+quiver), repeated many times, +is evidently *not* a convergent procedure. For example, a variant call +may be present in iteration n, absent in n+1, and then present again in +n+2. It is possible for subtle changes in mapping to change the set of +reads examined upon inspecting a genomic window, and therefore result in +a different consensus sequence there. We expect this to be the case +primarily for "borderline" (low confidence) base calls. + +SEE ALSO +======== + +**variantCaller**(1) diff --git a/debian/man/pbgff.5.md b/debian/man/pbgff.5.md new file mode 100644 index 0000000..efcb624 --- /dev/null +++ b/debian/man/pbgff.5.md @@ -0,0 +1,164 @@ +% PBGFF(5) 2.1 | Bioinformatics formats +% Pacific Biosciences <[email protected]> +% August 2014 + +NAME +==== + +**pbgff** - Pacific Biosciences extended GFFv3 file format + +DESCRIPTION +=========== + +As of this version, `variants.gff` is our primary variant call file +format. The `variants.gff` file is based on the [GFFv3 +standard](http://www.sequenceontology.org/gff3.shtml). The GFFv3 +standard describes a tab-delimited plain-text file meta-format for +describing genomic "features." Each gff file consists of some initial +"header" lines supplying metadata, and then a number of "feature" lines +providing information about each identified variant. + +The GFF Coordinate System +------------------------- + +All coordinates in GFF files are 1-based, and all intervals `start, end` +are understood as including both endpoints. + +Headers +------- + +The `variants.gff` file begins with a block of metadata headers, which +looks like the following: + + ##gff-version 3 + ##pacbio-variant-version 2.1 + ##date Tue Feb 28 17:44:18 2012 + ##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.12 + ##source GenomicConsensus v0.1.0 + ##source-commandline callVariants.py --algorithm=plurality aligned_reads.cmp.h5 -r spinach.fasta -o variants.gff + ##source-alignment-file /home/popeye/data/aligned_reads.cmp.h5 + ##source-reference-file /home/popeye/data/spinach.fasta + ##sequence-region EGFR_Exon_23 1 189 + ##sequence-header EGFR_Exon_24 1 200 + +The `source` and `source-commandline` describe the name and version of +the software generating the file. `pacbio-variant-version` reflects the +specification version that the file contents should adhere to. + +The `sequence-region` headers describe the names and extents of the +reference groups (i.e. reference contigs) that will be refered to in the +file. The names are the same as the full FASTA header. + +`source-alignment-file` and `source-reference-file` record absolute +paths to the primary input files. + +Feature lines +------------- + +After the headers, each line in the file describes a genomic *feature*; +in this file, all the features are potential variants flagged by the +variant caller. The general format of a variant line is a 9-column +(tab-delimited) record, where the first 8 columns correspond to fixed, +predefined entities in the GFF standard, while the 9th column is a +flexible semicolon-delimited list of mappings `key=value`. + +The 8 predefined columns are as follows: + + ------- -------- --------------------------------- ------------------- + Column Name Description Example + Number + + 1 seqId The full FASTA header for the `lambda_NEB3011` + reference contig. + + 2 source (unused; always populated with `.` + `.`) + + 3 type the type of variant. One of `substitution` + `insertion`, `deletion`, or + `substitution`. + + 4 start 1-based start coordinate for the 200 + variant. + + 5 end 1-based end coordinate for the 215 + variant. start\<=end always + obtains, regardless of strand. + + 6 score unused; populated with `.` `.` + + 7 strand unused; populated with `.` `.` + + 8 phase unused; populated with `.` `.` + ------- -------- --------------------------------- ------------------- + +The attributes in the 9th (final) column are as follows: + ++-----------------+-------------------------------+--------------------+ +| Key | Description | Example value | ++-----------------+-------------------------------+--------------------+ +| `coverage` | the read coverage of the | `42` | +| | variant site (not the variant | | +| | itself) | | ++-----------------+-------------------------------+--------------------+ +| `confidence` | the phred-scaled probability | `37` | +| | that the variant is real, | | +| | rounded to the nearest | | +| | integer and truncated at 93 | | ++-----------------+-------------------------------+--------------------+ +| `reference` | the reference base or bases | `T`, `.` | +| | for the variant site. May be | | +| | `.` to represent a | | +| | zero-length substring (for | | +| | insertion events) | | ++-----------------+-------------------------------+--------------------+ +| `variantSeq` | the read base or bases | `T` | +| | corresponding to the variant. | : (haploid); | +| | `.` encodes a zer-length | | +| | string, as for a deletion. | `T/C`, `T/.` | +| | | : (heterozygous) | ++-----------------+-------------------------------+--------------------+ +| `frequency` | the read coverage of the | `13` | +| | variant itself; for | : (haploid) | +| | heterozygous variants, the | | +| | frequency of both observed | `15/12` | +| | alleles. This is an optional | : (heterozygous) | +| | field. | | ++-----------------+-------------------------------+--------------------+ + +The attributes may be present in any order. + +The four types of variant we support are as follows. *(Recall that the +field separator is a tab, not a space.)* + +1. Insertion. Examples: + + ref00001 . insertion 8 8 . . . reference=.;variantSeq=G;confidence=22;coverage=18;frequency=10 + ref00001 . insertion 19 19 . . . reference=.;variantSeq=G/.;confidence=22;coverage=18;frequency=7/5 + +> For insertions, start==end, and the insertion event is understood as +> taking place *following* the reference position start. + +2. Deletion. Examples: + + ref00001 . deletion 348 349 . . . reference=G;variantSeq=.;confidence=39;coverage=25;frequency=20 + ref00001 . deletion 441 443 . . . reference=GG;variantSeq=GG/.;confidence=39;coverage=25;frequency=8/8 + +3. Substitution. Examples: + + ref000001 . substitution 100 102 . . . reference=GGG;variantSeq=CCC;confidence=50;coverage=20;frequency=16 + ref000001 . substitution 200 201 . . . reference=G;variantSeq=G/C;confidence=50;coverage=20;frequency=10/6 + +Compression +----------- + +The gff metaformat is verbose, so for practical purposes we will gzip +encode `variants.gff` files as `variants.gff.gz`. Consumers of the +variant file should be able to read it in either form. + +SEE ALSO +======== + +The VCF and BED standards describe variant-call specific file formats. +We can currently translate variants.gff files to these formats, but they +are not the primary output of the variant callers. diff --git a/debian/man/plurality.1 b/debian/man/plurality.1 index e12ab61..9960f00 100644 --- a/debian/man/plurality.1 +++ b/debian/man/plurality.1 @@ -7,7 +7,7 @@ plurality \- simple variant calling algorithm .SH DESCRIPTION See -.BR quiver-faq (7) +.BR genomicconsensus (7) for a more detailed description of \f[B]plurality\f[P]. .SH OPTIONS @@ -17,7 +17,6 @@ for a more detailed description of \f[B]plurality\f[P]. Show summary of options. .SH SEE ALSO -.BR quiver-faq (7) -.BR quiver (1) -.BR variantCaller.py (1) +.BR genomicconsensus (7) +.BR variantCaller (1) .BR blasr (1) diff --git a/debian/man/poa.1 b/debian/man/poa.1 new file mode 100644 index 0000000..3116a2a --- /dev/null +++ b/debian/man/poa.1 @@ -0,0 +1,22 @@ +.TH POA 1 + +.SH NAME + +poa \- consensus caller for Pacific Biosciences data + +.SH DESCRIPTION + +See +.BR genomicconsensus (7) +for a more detailed description of \f[B]poa\f[P]. + +.SH OPTIONS + +.TP +.B \-h, \-\-help +Show summary of options. + +.SH SEE ALSO +.BR genomicconsensus (7) +.BR variantCaller (1) +.BR blasr (1) diff --git a/debian/man/quiver.1 b/debian/man/quiver.1 new file mode 100644 index 0000000..00f0f57 --- /dev/null +++ b/debian/man/quiver.1 @@ -0,0 +1,22 @@ +.TH QUIVER 1 + +.SH NAME + +quiver \- quality-value aware consensus caller for Pacific Biosciences data + +.SH DESCRIPTION + +See +.BR genomicconsensus (7) +for a more detailed description of \f[B]quiver\f[P]. + +.SH OPTIONS + +.TP +.B \-h, \-\-help +Show summary of options. + +.SH SEE ALSO +.BR genomicconsensus (7) +.BR variantCaller (1) +.BR blasr (1) diff --git a/debian/man/variantCaller.1.md b/debian/man/variantCaller.1.md new file mode 100644 index 0000000..c7c4aa2 --- /dev/null +++ b/debian/man/variantCaller.1.md @@ -0,0 +1,254 @@ +% variantCaller(1) +% +% + +NAME +==== + +**variantCaller** - variant-calling algorithms for PacBio sequencing data + +SYNOPSIS +======== + +`variantCaller` is invoked from the command line. For example, a +simple invocation is: + + variantCaller -j8 --algorithm=quiver \ + -r lambdaNEB.fa \ + -o variants.gff \ + aligned_reads.cmp.h5 + +which requests that variant calling proceed, - using 8 worker processes, +- employing the **quiver** algorithm, - taking input from the file +`aligned_reads.cmp.h5`, - using the FASTA file `lambdaNEB.fa` as the +reference, - and writing output to `variants.gff` (see **pbgff**(5)). + +A particularly useful option is `--referenceWindow/-w`: this option +allows the user to direct the tool to perform variant calling +exclusively on a *window* of the reference genome, where the + +OPTIONS +======= + + variantCaller --help + +will provide a help message explaining all available options. + +NOTES +===== + +Input and output +---------------- + +`variantCaller` requires two input files: + +- A file of reference-aligned reads in PacBio's standard cmp.h5 + format; +- A FASTA file that has been processed by ReferenceUploader. + +The tool's output is formatted in the GFF format, as described in (how +to link to other file?). External tools can be used to convert the GFF +file to a VCF or BED file---two other standard interchange formats for +variant calling. + +> **note** +> +> **Input cmp.h5 file requirements** +> +> `variantCaller` requires its input cmp.h5 file to be be sorted. An +> unsorted file can be sorting using the tool `cmpH5Sort.py`. +> +> The **quiver**(1) algorithm in `variantCaller` requires its input +> cmp.h5 file to have the following *pulse features*: - `InsQV`, - +> `SubsQV`, - `DelQV`, - `DelTag`, - `MergeQV`. +> +> The **plurality**(1) algorithm can be run on cmp.h5 files that lack +> these features. + +The input file is the main argument to `variantCaller`, while the +output file is provided as an argument to the `-o` flag. For example, + + variantCaller aligned_reads.cmp.h5 -r lambda.fa -o variants.gff + +will read input from `aligned_reads.cmp.h5`, using the reference +`lambda.fa`, and send output to the file `variants.gff`. The extension +of the filename provided to the `-o` flag is meaningful, as it +determines the output file format. The file formats presently supported, +by extension, are + +`.gff` +: GFFv3 format + +`.txt` +: a simplified human readable format used primarily by the developers + +If the `-o` flag is not provided, the default behavior is to output to a +`variants.gff` in the current directory. + +> **note** +> +> `variantCaller` does **not** modify its input cmp.h5 file in any +> way. This is in contrast to previous variant callers in use at PacBio, +> which would write a *consensus* dataset to the input cmp.h5 file. + +Available algorithms +-------------------- + +At this time there are two algorithms available for variant calling: +**plurality** and **quiver**. + +**Plurality** is a simple and very fast procedure that merely tallies +the most frequent read base or bases found in alignment with each +reference base, and reports deviations from the reference as potential +variants. + +**Quiver** is a more complex procedure based on algorithms originally +developed for CCS. Quiver leverages the quality values (QVs) provided by +upstream processing tools, which provide insight into whether +insertions/deletions/substitutions were deemed likely at a given read +position. Use of **quiver** requires the `ConsensusCore` and `ConsensusCore2` +libraries as well as trained parameter set, which will be loaded from a +standard location (TBD). Arrow and Quiver can be thought of as +local-realignment procedures (QV-aware in the case of Quiver). + +Both algorithms are expected to converge to *zero* errors (miscalled +variants) as coverage increases; however **quiver** should converge much +faster (i.e., fewer errors at low coverage), and should provide greater +variant detection power at a given error level. + +### Confidence values + +Both *quiver* and *plurality* make a confidence metric available for +every position of the consensus sequence. The confidence should be +interpreted as a phred-transformed posterior probability that the +consensus call is incorrect; i.e. + +$$QV = -10 \log_{10}(p_{err})$$ + +`variantCaller` clips reported QV values at 93---larger values cannot +be encoded in a standard FASTQ file. + +### Chemistry specificity + +The Quiver algorithm parameters are trained per-chemistry. SMRTanalysis +software loads metadata into the cmp.h5 to indicate the chemistry used +per movie. Quiver sees this table and automatically chooses the +appropriate parameter set to use. This selection can be overridden by a +command line flag. + +When multiple chemistries are represented in the reads in a cmp.h5, +Quiver will model each read appropriately using the parameter set for +its chemistry, thus yielding optimal results. + +### Performance Requirements + +`variantCaller` performs variant calling in parallel using multiple +processes. Work splitting and inter-process communication are handled +using the Python `multiprocessing` module. Work can be split among an +arbitrary number of processes (using the `-j` command-line flag), but +for best performance one should use no more worker processes than there +are CPUs in the host computer. + +The running time of the *plurality* algorithm should not exceed the +runtime of the BLASR process that produced the cmp.h5. The running time +of the *quiver* algorithm should not exceed 4x the runtime of BLASR. + +The amount of core memory (RAM) used among all the python processes +launched by a `variantCaller` run should not exceed the size of the +uncompressed input `.cmp.h5` file. + +EXAMPLES +======== + +Basic running instructions +-------------------------- + +Basic usage---using 8 CPUs to compute consensus of mapped reads and +variants relative to a reference---is as follows: + + % quiver -j8 aligned_reads{.cmp.h5, .bam, .fofn, or .xml} \ + > -r reference{.fasta or .xml} -o variants.gff \ + > -o consensus.fasta -o consensus.fastq + +`quiver` is a shortcut for `variantCaller --algorithm=quiver`. +Naturally, to use arrow you could use the `arrow` shortcut or +`variantCaller --algorithm=arrow`. + +in this example we perform haploid consensus and variant calling on the +mapped reads in the `aligned_reads.bam` which was aligned to +`reference.fasta`. The `reference.fasta` is only used for designating +variant calls, not for computing the consensus. The consensus quality +score for every position can be found in the output FASTQ file. + +*Note that 2.3 SMRTanalysis does not support "dataset" input (FOFN or +XML files); those who need this feature should wait for the forthcoming +release of SMRTanalysis 3.0 or build from GitHub sources.* + +Running a large-scale resequencing/polishing job in SMRTanalysis 2.3 +-------------------------------------------------------------------- + +To run a large-scale resequencing job (\>50 megabase genome @ 50x +coverage,nominally), you want to spread the computation load across +multiple nodes in your computing cluster. + +The smrtpipe workflow engine in SMRTanalysis 2.3 provides a convenient +workflow automating this---it will automatically spread the load for +both mapping and quiver jobs among your available cluster nodes. This is +accessible via the SMRTportal UI; the simplest way to set up and run +thse workflows is via tha UI. Nonetheless, we include command-line +instructions for completeness. + +If you have to run the smrtpipe workflow manually from the command line, +a recipe is as folows: + +1. Make sure the reference you will align and compare against is + present in a SMRTportal "reference repository". Even if you don't + want to use SMRTportal, you need to build/import the reference + appropriately, and the simplest way to do that is via SMRTportal. If + you don't have a SMRTportal instance, you can use the + `referenceUploader` command to prepare your reference repository. +2. Prepare an "input.fofn" file listing, one-per-line, each "bax.h5" + file in your input data set. +3. Convert the "input.fofn" to an "input.xml" file that SMRTpipe can + understand: + + \$ fofnToSmrtpipeInput.py input.fofn \> input.xml + +4. Prepare your "params.xml" file. Here is a [params.xml + template](./params-template.xml) you can use; you should just need + to edit the reference path. +5. Activate your SMRTanalysis environment, and invoke smrtpipe: + + \$ source \<SMRT Analysis\>/etc/setup.sh \$ smrtpipe.py --distribute + --params=params.xml xml:input.xml + +6. After successful execution is complete, the results should be + available as data/consensus.fast[aq].gz and data/variants.gff.gz, + etc. + +Please consult the [SMRTpipe reference +manual](http://www.pacb.com/wp-content/uploads/2015/09/SMRT-Pipe-Reference-Guide.pdf) +for further information. + +*Note that resequencing (mapping reads against a reference genome and +then calling consensus and identifying variants) and polishing (mapping +reads against a draft assembly and then taking the consensus output as +the final, polished, assembly) are the same algorithmic operation, the +only effective difference is that the "variants.gff" output is not +biologically meaningful in the polishing case---it just records the +edits that were made to the draft to produce the polished assembly.* + +Running a large-scale quiver/arrow job in SMRTanalysis 3.0+ +----------------------------------------------------------- + +(Forthcoming) + + +SEE ALSO +======== + +**quiver**(1) +**arrow**(1) +**plurality**(1) +**pbgff**(5) +**blasr**(1) diff --git a/debian/patches/manpages.patch b/debian/patches/manpages.patch deleted file mode 100644 index 26c441e..0000000 --- a/debian/patches/manpages.patch +++ /dev/null @@ -1,432 +0,0 @@ -Description: Turn upstream file format documentation into a manpage - These patches change the rst file to be suitable input to rst2man. - I followed the example at - http://docutils.sourceforge.net/sandbox/manpage-writer/rst2man.txt -Author: Afif Elghraoui <[email protected]> -Forwarded: not-needed -Last-Update: 2016-02-11 ---- pbgenomicconsensus.orig/doc/VariantsGffSpecification.rst -+++ pbgenomicconsensus/doc/VariantsGffSpecification.rst -@@ -1,6 +1,20 @@ -+===== -+pbgff -+===== -+ -+---------------------------------------------- -+Pacific Biosciences extended GFFv3 file format -+---------------------------------------------- -+ -+:Author: Pacific Biosciences <[email protected]> -+:Date: August 2014 -+:Version: 2.1 -+:Manual section: 5 -+:Manual group: Bioinformatics formats - --``variants.gff`` File Format (Version 2.1) --============================================ -+ -+DESCRIPTION -+=========== - - As of this version, ``variants.gff`` is our primary variant call file - format. The ``variants.gff`` file is based on the `GFFv3 standard`_. -@@ -40,7 +54,7 @@ - ``pacbio-variant-version`` reflects the specification version that the - file contents should adhere to. - -- The ``sequence-region`` headers describe the names and extents of -+The ``sequence-region`` headers describe the names and extents of - the reference groups (i.e. reference contigs) that will be refered to - in the file. The names are the same as the full FASTA header. - -@@ -162,8 +176,8 @@ - the variant file should be able to read it in either form. - - --Other file formats -------------------- -+SEE ALSO -+======== - - The VCF and BED standards describe variant-call specific file formats. - We can currently translate `variants.gff` files to these formats, but ---- pbgenomicconsensus.orig/doc/HowToQuiver.rst -+++ pbgenomicconsensus/doc/HowToQuiver.rst -@@ -1,51 +1,22 @@ -+====== -+quiver -+====== - --How to install and use Quiver --============================= -+-------------------------------------------------------------- -+genomic consensus caller designed for Pacific Biosciences data -+-------------------------------------------------------------- - --Quiver is bundled in SMRTanalysis version 1.4 and later. The easiest --way to get Quiver is to install the most recent version of SMRTanalysis. -+:Date: February 2016 -+:Manual section: 1 - --If you want to install Quiver as a standalone package from the latest --bleeding-edge code, follow the instructions below. - --*Note: please install this software on an isolated machine that does --not have SMRTanalysis installed. Older versions of SMRTanalysis --pollute the ``PYTHONPATH``, which has the undesirable effect of --overriding ``virtualenv``-installed modules.* -+DESCRIPTION -+=========== - --Background ------------ - **Quiver** is an algorithm for calling highly accurate consensus from - multiple PacBio reads, using a pair-HMM exploiting both the basecalls - and QV metrics to infer the true underlying DNA sequence. - --Quiver is available through the ``quiver`` script from the --``GenomicConsensus`` package. To use Quiver, the following PacBio --software is required. -- --- ``GenomicConsensus``, containing ``quiver`` --- ``ConsensusCore``, a C++ library containing the core computational -- routines for Quiver --- ``pbcore``, a package providing access to PacBio data files -- -- --Required libraries and tools ------------------------------ --To install the PacBio software, the following are required: -- --- Boost >= 1.4.7 (standard C++ libraries) --- SWIG >= 2.0.7 (library wrapper generator) --- Python 2.7.3 --- virtualenv (builds isolated Python environments) -- --If you are within PacBio, these requirements are already installed --within the cluster environment. -- --Otherwise, you will need to install them yourself. The automatic --installation script requires that the ``swig`` executable is in your --UNIX ``$PATH`` and that your boost installation can be found under --``/usr/include`` or ``/usr/local``. -- - - Data file requirements - ---------------------- -@@ -66,45 +37,11 @@ - please upgrade. - - --Installation instructions --------------------------- -- --Step 1: Set up your Python environment --`````````````````````````````````````` --I recommend using a Python *virtualenv* to isolate your sandbox. -- --To set up a new virtualenv, do :: -- -- $ cd; virtualenv -p python2.7 --no-site-packages VE-QUIVER -- --and activate the virtualenv using :: -- -- $ source ~/VE-QUIVER/bin/activate -- -- --Step 2: Install PacBio libraries --```````````````````````````````` --To install the PacBio software, execute :: -- -- $ pip install pbcore -- -- $ git clone https://github.com/PacificBiosciences/ConsensusCore -- $ cd ConsensusCore; python setup.py install --swig=$SWIG --boost=$BOOST -- $ pip install git+https://github.com/PacificBiosciences/GenomicConsensus -- --where you replace ``$SWIG`` with the path to your ``swig`` executable --and ``$BOOST`` with the path to your boost install (the top level --directory). (Note that if SWIG is in your ``$PATH`` and boost is in --``/usr/local`` or ``/usr/include/``, you do not need to specify these --flags on the command line---``setup.py`` will find them). -- -- --Step 3: Run Quiver --`````````````````` --Those who wish to call consensus on a resequencing job can simply use --the ``quiver`` script that has been installed in your --virtualenv (from `GenomicConsensus`). -+EXAMPLES -+======== - -+Running Quiver -+-------------- - For example, :: - - $ quiver -j8 aligned_reads.bam \ -@@ -120,8 +57,8 @@ - QV metrics required for optimal Quiver accuracy. The command will still work, - but it will give a warning that its accuracy will be suboptimal. - --Step 3.5: Run Quiver on Multiple Input Files --```````````````````````````````````````````` -+Run Quiver on Multiple Input Files -+---------------------------------- - Multiple alignment files in a FOFN (File of File Names) can be quivered against - a single reference (`GenomicConsensus` >= 1.1.0). - -@@ -140,22 +77,18 @@ - Quiver can also be used with DataSet XML files. See pbcore for details on - generating new DataSet XML files for your alignment files. - --Step 4: Highly-accurate assembly consensus --`````````````````````````````````````````` -+Highly-accurate assembly consensus -+---------------------------------- - Quiver enables consensus accuracies on genome assemblies at accuracies - approaching or even exceeding Q60 (one error per million bases). If - you use the HGAP assembly protocol in SMRTportal 2.0 or later, Quiver - runs automatically as the final "assembly polishing" step. - - --Resources ----------- --Here is an `FAQ document`_ to address common issues. -- --For a technical summary of some of the details of how Quiver works, I --recommend reading the supplementary material of our 2013 *Nature --Methods* `HGAP paper`_ -- -- --.. _`FAQ document`: https://github.com/PacificBiosciences/GenomicConsensus/blob/master/doc/QuiverFAQ.rst --.. _`HGAP paper`: http://www.nature.com/nmeth/journal/v10/n6/full/nmeth.2474.html -+SEE ALSO -+======== -+ -+**quiver-faq**\ (7) -+**plurality**\ (1) -+**variantCaller**\ (1) -+**pbgff**\ (5) ---- pbgenomicconsensus.orig/doc/QuiverFAQ.rst -+++ pbgenomicconsensus/doc/QuiverFAQ.rst -@@ -1,5 +1,17 @@ --Quiver FAQ - ========== -+quiver-faq -+========== -+ -+------------------------------------------------------------------------ -+frequently asked questions regarding the quiver genomic consensus caller -+------------------------------------------------------------------------ -+ -+:Date: October 2015 -+:Version: 1.0.0 -+:Manual section: 7 -+ -+QUESTIONS -+========= - - What are EviCons? GenomicConsensus? Quiver? Plurality? - ------------------------------------------------------------ -@@ -67,14 +79,6 @@ - The name **Quiver** reflects a consensus-calling algorithm that is - `QV-aware`. - --How do I run Quiver? ---------------------- -- --For general instructions on installing and running, see the --HowToQuiver_ document. -- -- -- - What does Quiver put in its output files? - ----------------------------------------- - There are three output files from Quiver: -@@ -155,7 +159,7 @@ - - - Does Quiver need to know what sequencing chemistry was used? ------------------------------------------------------------ -+------------------------------------------------------------ - - At present, the Quiver model is trained per-chemistry, so it is very - important that Quiver knows the sequencing chemistries used. -@@ -175,7 +179,7 @@ - - - Can a mix of chemistries be used in a cmp.h5 file for Quiver? ------------------------------------------------------------- -+------------------------------------------------------------- - - Yes! Quiver automatically sees the chemistry *per-SMRT Cell*, so it - can figure out the right parameters for each read and model them -@@ -289,7 +293,7 @@ - - - How can the `MapQV` filter be turned off and when should it be? ---------------------------------------------------------------- -+--------------------------------------------------------------- - The `MapQV` filter can be disabled using the flag - ``--mapQvThreshold=0`` (shorthand: ``-m=0``). If running a - Quiver job via SMRT Portal, this can be done by unchecking the "Use -@@ -299,7 +303,7 @@ - - - How can variant calls made by Quiver be inspected or validated? ---------------------------------------------------------------- -+--------------------------------------------------------------- - When in doubt, it is easiest to inspect the region in a tool like - SMRT View, which enables you to view the reads aligned to the region. - Deletions and substitutions should be fairly easy to spot; to view -@@ -324,7 +328,7 @@ - - - What happens when the sample is a mixture, or diploid? ------------------------------------------------------- -+------------------------------------------------------ - At present, Quiver assumes a haploid sample, and the behavior of - *Quiver* on sample mixtures or diploid/polyploid samples is - *undefined*. The program will not crash, but the output results are -@@ -372,5 +376,10 @@ - calls. - - -+SEE ALSO -+======== - --.. _HowToQuiver: https://github.com/PacificBiosciences/GenomicConsensus/blob/master/doc/HowToQuiver.rst -+**quiver**\ (1) -+**plurality**\ (1) -+**variantCaller**\ (1) -+**pbgff**\ (5) ---- pbgenomicconsensus.orig/doc/VariantCallerFunctionalSpecification.rst -+++ pbgenomicconsensus/doc/VariantCallerFunctionalSpecification.rst -@@ -1,35 +1,16 @@ -+============= -+variantCaller -+============= -+ -+----------------------------------------------------- -+variant-calling algorithms for PacBio sequencing data -+----------------------------------------------------- -+ -+:Date: February 2016 -+:Manual section: 1 - -- --Variant Caller Functional Specification --======================================= -- --Version 2.2 -- -- --Introduction -------------- -- --This document describes the interface, input/output, and performance --characteristics of ``variantCaller.py``, a variant calling tool --provided by the ``GenomicConsensus`` package. -- -- --Software Overview ------------------- -- --The ``GenomicConsensus`` package provides a command-line tool, --``variantCaller.py``, which provides several variant-calling algorithms for --PacBio sequencing data. ``variantCaller.py`` replaces ``EviCons`` and --``SmrtBayes``, the previous (haploid, diploid---respectively) variant callers --at PacBio. -- -- -- --Functional Requirements ------------------------- -- --Command-line interface --`````````````````````` -+SYNOPSIS -+======== - - ``variantCaller.py`` is invoked from the command line. For example, a simple - invocation is:: -@@ -44,23 +26,24 @@ - - employing the **quiver** algorithm, - - taking input from the file ``aligned_reads.cmp.h5``, - - using the FASTA file ``lambdaNEB.fa`` as the reference, --- and writing output to ``variants.gff``. -+- and writing output to ``variants.gff`` (see **pbgff**\ (5)). - - A particularly useful option is ``--referenceWindow/-w``: this option - allows the user to direct the tool to perform variant calling - exclusively on a *window* of the reference genome, where the - - --Invoking -+OPTIONS -+======= - - :: - - variantCaller.py --help - --will provide a help message explaining all available options; they will be --documented here shortly. -- -+will provide a help message explaining all available options. - -+NOTES -+===== - - Input and output - ```````````````` -@@ -82,7 +65,7 @@ - be sorted. An unsorted file can be sorting using the tool - ``cmpH5Sort.py``. - -- The *quiver* algorithm in ``variantCaller.py`` requires its -+ The **quiver**\ (1) algorithm in ``variantCaller`` requires its - input cmp.h5 file to have the following *pulse features*: - - ``InsQV``, - - ``SubsQV``, -@@ -90,7 +73,7 @@ - - ``DelTag``, - - ``MergeQV``. - -- The *plurality* algorithm can be run on cmp.h5 files that lack -+ The **plurality**\ (1) algorithm can be run on cmp.h5 files that lack - these features. - - The input file is the main argument to ``variantCaller.py``, while the output -@@ -147,19 +130,6 @@ - fewer errors at low coverage), and should provide greater variant detection - power at a given error level. - -- --Software interfaces --``````````````````` --The ``GenomicConsensus`` module has two essential dependencies: -- --1. **pbcore**, the PacBio Python bioinformatics library --2. **ConsensusCore**, a C++ library with SWIG bindings that provides access to -- the same algorithms used in circular consensus sequencing. -- --Both of these modules are easily installed using their ``setup.py`` scripts, --which is the canonical means of installing Python packages. -- -- - Confidence values - ----------------- - -@@ -209,3 +179,11 @@ - The amount of core memory (RAM) used among all the python processes launched - by a ``variantCaller.py`` run should not exceed the size of the uncompressed - input ``.cmp.h5`` file. -+ -+SEE ALSO -+======== -+ -+**quiver**\ (1) -+**plurality**\ (1) -+**pbgff**\ (5) -+**blasr**\ (1) diff --git a/debian/patches/series b/debian/patches/series index 94817e0..0b14f63 100644 --- a/debian/patches/series +++ b/debian/patches/series @@ -1,3 +1,2 @@ spelling.patch -manpages.patch verbose-testing.patch diff --git a/debian/pbgenomicconsensus.manpages b/debian/pbgenomicconsensus.manpages index 19f429f..9914f7b 100644 --- a/debian/pbgenomicconsensus.manpages +++ b/debian/pbgenomicconsensus.manpages @@ -1 +1,3 @@ -debian/man/* +debian/man/*.1 +debian/man/*.5 +debian/man/*.7 diff --git a/debian/rules b/debian/rules index 5677229..a89ca39 100755 --- a/debian/rules +++ b/debian/rules @@ -10,19 +10,47 @@ export PYBUILD_NAME := $(DEB_SOURCE) BINDIR=$(CURDIR)/debian/$(DEB_SOURCE)/usr/bin/ MANDIR=$(CURDIR)/debian/$(DEB_SOURCE)/usr/share/man/ +manpages = \ +pbgff.5.md \ +variantCaller.1.md \ +genomicconsensus.7.md \ + +%: %.md + pandoc -s --to man $< -o $@ + + %: dh $@ --with python2 --buildsystem=pybuild override_dh_auto_test: PYTHONPATH=$(CURDIR) PATH=$(CURDIR)/bin:$$PATH $(MAKE) tests +override_dh_auto_build: docs + dh_auto_build + +.PHONY: docs +docs: $(addprefix $(CURDIR)/debian/man/, $(manpages:.md=)) + +.PHONY: clean-docs +clean-docs: + $(RM) $(addprefix $(CURDIR)/debian/man/, $(manpages:.md=)) + override_dh_install: dh_install # Place executable programs in the main package mkdir -p $(BINDIR) mv debian/python-$(DEB_SOURCE)/usr/bin/* $(BINDIR) - mkdir -p $(MANDIR)/man5 $(MANDIR)/man1 $(MANDIR)/man7 - rst2man doc/VariantsGffSpecification.rst > $(MANDIR)/man5/pbgff.5 - rst2man doc/VariantCallerFunctionalSpecification.rst > $(MANDIR)/man1/variantCaller.1 - rst2man doc/HowToQuiver.rst > $(MANDIR)/man1/quiver.1 - rst2man doc/QuiverFAQ.rst > $(MANDIR)/man7/quiver-faq.7 + +override_dh_auto_clean: clean-docs + dh_auto_clean + +# Helper for setting up the documentation as manpages +# Usage: VPATH=doc/ debian/rules prepare-doc +prepare-doc: $(manpages) + +pbgff.5.md: VariantsGffSpecification.rst +variantCaller.1.md: VariantCallerFunctionalSpecification.rst +genomicconsensus.7.md: FAQ.rst + +$(manpages): + pandoc -s $< -o $@ -- Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/pbgenomicconsensus.git _______________________________________________ debian-med-commit mailing list [email protected] http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/debian-med-commit
