This is an automated email from the git hooks/post-receive script.

tille pushed a commit to branch master
in repository vsearch.

commit fbd885dbb41daaeb76c8eabc8b38ca4537832d4b
Author: Andreas Tille <ti...@debian.org>
Date:   Fri Sep 16 15:25:33 2016 +0200

    Imported Upstream version 2.0.5
---
 README.md        |   9 +-
 configure.ac     |   2 +-
 man/vsearch.1    | 527 +++++++++++++++++++++++++++++++++++--------------------
 src/fastqops.cc  |  14 +-
 src/subsample.cc |  58 +++++-
 5 files changed, 405 insertions(+), 205 deletions(-)

diff --git a/README.md b/README.md
index b9e539e..070093a 100644
--- a/README.md
+++ b/README.md
@@ -131,14 +131,12 @@ For the accuracy assessment searches in Rfam 11.0 with 
100 replicates of the que
 
 **Clustering:** The clustering commands `--cluster_smallmem` and 
`--cluster_fast` have been implemented. These commands support multiple 
threads. The only difference between `--cluster_smallmem` and `--cluster_fast` 
is that `--cluster_fast` will sort the sequences by length before clustering, 
while `--cluster_smallmem` require the sequences to be in length-sorted order 
unless the `--usersort` option is specified. An additional clustering command 
called `--cluster_size` has been added tha [...]
 
-The speed of clustering with VSEARCH relative to USEARCH depends on how many 
threads are used. Running with a single thread VSEARCH currently seems to be 
2-4 times slower than with USEARCH, depending on parameters. Clustering has 
been parallelized with threads in VSEARCH, but clustering does not seem to be 
parallelized in USEARCH (despite what the name and documentation for 
`--cluster_fast` seems to indicate). Clustering with VSEARCH using 4-8 threads 
is often faster than USEARCH. The sp [...]
+The speed of clustering with VSEARCH relative to USEARCH depends on how many 
threads are used. Running with a single thread VSEARCH currently seems to be 
2-4 times slower than with USEARCH, depending on parameters. Speed also depends 
on sequence length, and vsearch is relatively slower with longer sequences 
compared to usearch.
 
 **Chimera detection:** Chimera detection using the algorithm described by 
Edgar *et al.* (2011) has been implemented in VSEARCH. Both the `--uchime_ref` 
and `--uchime_denovo` commands and all their options are supported.
 
 A preliminary assessment of the accuracy of VSEARCH on chimera detection has 
been performed using the SIMM dataset described in the UCHIME paper. See the 
`eval/chimeval.sh` script and the results in `eval/chimeval.txt` for details. 
On the datasets with 1-5% substitutions, VSEARCH is generally on par with the 
original UCHIME implementation (version 4.2.40), and a bit more accurate than 
the implementation in USEARCH (version 7.0.1090). On the datasets with 1-5% 
indels, VSEARCH is clearly m [...]
 
-VSEARCH is about 40% faster than USEARCH on *de novo* chimera detection and 
about 30% faster on detection against a reference database. In VSEARCH 
`uchime_ref` is multithreaded, while `uchime_denovo` is not.
-
 **Dereplication and sorting:** The dereplication and sorting commands seems to 
be considerably faster in VSEARCH than in USEARCH.
 
 **Masking:** VSEARCH by default uses an optimized multithreaded 
re-implementation of the well-known DUST algorithm by Tatusov and Lipman 
(source: ftp://ftp.ncbi.nlm.nih.gov/pub/tatusov/dust/version1/src/) to mask 
simple repeats and low-complexity regions in the sequences before searching and 
clustering. USEARCH by default uses an undocumented rapid masking method called 
"fastnucleo" that seems to mask fewer and smaller regions than dust. USEARCH 
may also be run with the DUST masking meth [...]
@@ -285,8 +283,9 @@ Thanks to the following people for patches and other 
suggestions for improvement
 
 ## Citing VSEARCH
 
-No papers about VSEARCH have been published yet, but a manuscript is in 
preparation.
-For now, please cite the [VSEARCH GitHub 
repository](https://github.com/torognes/vsearch).
+The vsearch manuscript is now published in PeerJ Preprints: 
+
+Rognes T, Flouri T, Nichols B, Quince C, Mahé F. (2016) VSEARCH: a versatile 
open source tool for metagenomics. PeerJ Preprints 4:e2409v1 
https://doi.org/10.7287/peerj.preprints.2409v1
 
 
 ## Test datasets
diff --git a/configure.ac b/configure.ac
index 5364c38..a55a50f 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2,7 +2,7 @@
 # Process this file with autoconf to produce a configure script.
 
 AC_PREREQ([2.63])
-AC_INIT([vsearch], [2.0.3], [torog...@ifi.uio.no])
+AC_INIT([vsearch], [2.0.5], [torog...@ifi.uio.no])
 AM_INIT_AUTOMAKE([subdir-objects])
 AC_LANG([C++])
 AC_CONFIG_SRCDIR([src/vsearch.cc])
diff --git a/man/vsearch.1 b/man/vsearch.1
index 6b0bc15..3c9260f 100644
--- a/man/vsearch.1
+++ b/man/vsearch.1
@@ -1,5 +1,5 @@
 .\" 
============================================================================
-.TH vsearch 1 "August 2, 2016" "version 2.0.3" "USER COMMANDS"
+.TH vsearch 1 "September 1, 2016" "version 2.0.4" "USER COMMANDS"
 .\" 
============================================================================
 .SH NAME
 vsearch \(em chimera detection, clustering, dereplication and
@@ -35,8 +35,8 @@ Dereplication and rereplication:
 \fBvsearch\fR (\-\-derep_fulllength | \-\-derep_prefix)
 \fIfastafile\fR (\-\-output | \-\-uc) \fIoutputfile\fR [\fIoptions\fR]
 .PP
-\fBvsearch\fR (\-\-rereplicate) \fIfastafile\fR
-\-\-output \fIoutputfile\fR [\fIoptions\fR]
+\fBvsearch\fR \-\-rereplicate \fIfastafile\fR \-\-output
+\fIoutputfile\fR [\fIoptions\fR]
 .PP
 .RE
 FASTA/FASTQ file processing:
@@ -122,12 +122,16 @@ computers, thus providing fast and accurate data 
processing.
 .PP
 Comparing nucleotide sequences is at the core of \fBvsearch\fR. To
 speed up comparisons, \fBvsearch\fR implements an extremely fast
-implementation of the Needleman-Wunsch algorithm, making use of the
-Streaming SIMD Extensions (SSE2) of modern x86-64 CPUs. If SSE2
-instructions are not available, \fBvsearch\fR exits with an error
-message. For comparisons involving sequences longer than 5,000
-nucleotides, \fBvsearch\fR uses a slower alignment method with smaller
-memory requirements.
+Needleman-Wunsch algorithm, making use of the Streaming SIMD
+Extensions (SSE2) of modern x86-64 CPUs. If SSE2 instructions are not
+available, \fBvsearch\fR exits with an error message. Memory usage
+increases rapidly with sequence length: for example comparing two
+sequences of length 1 kb requires 8 MB of memory per thread, and
+comparing two 10 kb sequences requires 800 MB of memory per
+thread. For comparisons involving sequences with a length product
+greater than 25,000,000, \fBvsearch\fR uses a slower alignment method
+described by Hirschberg (1975) and Myers and Miller (1988), with
+smaller memory requirements.
 .\" 
----------------------------------------------------------------------------
 .SS Input
 \fBvsearch\fR input is a fasta (or fastq) file containing one or
@@ -205,17 +209,17 @@ that value to 0 to eliminate the wrapping.
 Decompress input using gzip. The option is required only when reading
 from a pipe, otherwise compression is automatically detected.
 .TP
-.B \-\-help | \-\-h
+.B \-\-help | \-h
 Display help text and exit.
 .TP
 .BI \-\-log \0filename
 Write messages to the specified log file. Information written includes
-program version, amount of memory available, number of cores and command
-line options. The start and finish times are also recorded as well as
-the elapsed time. The maximum amount of memory consumed is included.
-The different commands will usually also write some information about
-their results. Both fatal, warning and informational messages are
-written.
+program version, amount of memory available, number of cores and
+command line options, and if need be, informational messages, warnings
+and fatal errors. The start and finish times are also recorded as well
+as the elapsed time and the maximum amount of memory consumed. The
+different commands will usually also write some information about
+their results.
 .TP
 .BI \-\-maxseqlength\~ "positive integer"
 All \fBvsearch\fR operations will discard sequences of length equal or
@@ -231,7 +235,7 @@ Do not truncate sequence labels at first space or tab, use 
the full
 header in output files.
 .TP
 .B \-\-quiet
-Suppress all output to stdout and stdout except for warnings and fatal
+Suppress all output to stdout and stderr except for warnings and fatal
 error messages.
 .TP
 .BI \-\-threads\~ "positive integer"
@@ -243,7 +247,7 @@ uchime_ref, cluster_fast, cluster_size, cluster_smallmem, 
maskfasta,
 allpairs_global, usearch_global.
 Only one thread is used for the other commands.
 .TP
-.B \-\-version | \-\-v
+.B \-\-version | \-v
 Output version information and exit.
 .RE
 .PP
@@ -279,13 +283,14 @@ should be at least 2 times more abundant than their 
chimera. Any
 positive value greater than 1.0 can be used.
 .TP
 .BI \-\-alignwidth\~ "positive integer"
-Width of the 3-way alignments in \-\-uchimealns output. The default
-value is 80. Set to 0 to eliminate wrapping.
+When using \-\-uchimealns, set the width of the 3-way alignments. The
+default value is 80 characters. Set to 0 to eliminate wrapping.
 .TP
 .BI \-\-borderline \0filename
-Output borderline chimeric sequences to \fIfilename\fR, in fasta format.
-Borderline chimeric sequences are sequences that have a high enough
-score but which are not sufficiently diverged from their closest parent.
+Output borderline chimeric sequences to \fIfilename\fR, in fasta
+format.  Borderline chimeric sequences are sequences that have a high
+enough score but which are not sufficiently diverged from their
+closest parent.
 .TP
 .BI \-\-chimeras \0filename
 Output chimeric sequences to \fIfilename\fR, in fasta format. Output
@@ -303,8 +308,8 @@ No vote pseudo-count (parameter \fIn\fR in the chimera 
scoring
 function) (default value is 1.4).
 .TP
 .B \-\-fasta_score
-Add the chimera score to the headers in the fasta output files for chimeras,
-non-chimeras and borderline sequences. A string similar to
+Add the chimera score to the headers in the fasta output files for
+chimeras, non-chimeras and borderline sequences. A string similar to
 ";uchime_denovo=0.1234;" or ";uchime_ref=5.6789;" will be added.
 .TP
 .BI \-\-mindiffs\~ "positive integer"
@@ -332,29 +337,25 @@ When relabelling, keep the old identifier in the header 
after a space.
 .TP
 .B \-\-relabel_md5
 Relabel sequences using the MD5 message digest algorithm applied to
-the sequence to replace the header.
-The sequence is converted to upper case and U is replaced by
-T before the digest is computed.
-The MD5 digest is a cryptographic hash function designed
-to minimize the probability that two
-different inputs will give the same output, even for very similar,
-but non-identical inputs. Still, there will always be a small,
-but non-zero, probability that two different inputs
-will give the same result.
-The MD5 digest generates a 128 bit (16 byte) result that is
-represented by 32 hexadecimal digits (0123456789abcdef).
-Use \-\-sizeout to conserve the abundance annotations.
+each sequence. Former sequence headers are discarded. The sequence is
+converted to upper case and U is replaced by T before the digest is
+computed.  The MD5 digest is a cryptographic hash function designed to
+minimize the probability that two different inputs will give the same
+output, even for very similar, but non-identical inputs. Still, there
+will always be a very small, but non-zero, probability that two
+different inputs will give the same result. The MD5 digest generates a
+128-bit (16-byte) digest that is represented by 16 hexadecimal numbers
+(using 32 symbols among 0123456789abcdef). Use \-\-sizeout to conserve
+the abundance annotations.
 .TP
 .B \-\-relabel_sha1
 Relabel sequences using the SHA1 message digest algorithm applied to
-the sequence to replace the header. It is similar to the
-\-\-relabel_md5 option but uses the SHA1 algorithm instead of the
-MD5 algorithm.
-The SHA1 digest generates a 160 bit (20 byte) result
-that is represented by 40 hexadecimal digits.
-The probability of a collision (two non-identical sequences
-having the same digest) is smaller for the SHA1 algorithm
-than for the MD5 algorithm.
+each sequence. It is similar to the \-\-relabel_md5 option but uses
+the SHA1 algorithm instead of the MD5 algorithm. The SHA1 digest
+generates a 160-bit (20-byte) result that is represented by 20
+hexadecimal numbers (40 symbols). The probability of a collision (two
+non-identical sequences having the same digest) is smaller for the
+SHA1 algorithm than it is for the MD5 algorithm.
 .TP
 .B \-\-self
 When using \-\-uchime_ref, ignore a reference sequence when its label
@@ -366,8 +367,8 @@ When using \-\-uchime_ref, ignore a reference sequence when 
its
 nucleotide sequence is strictly identical with the query sequence.
 .TP
 .B \-\-sizeout
-Add abundance annotations to the output fasta file (with the pattern
-";size=\fIinteger\fR;" to sequence headers) when relabelling.
+When relabelling, add abundance annotations to the output fasta file
+(with the pattern ";size=\fIinteger\fR;" to sequence headers).
 .TP
 .BI \-\-uchime_denovo \0filename
 Detect chimeras present in the fasta-formatted \fIfilename\fR, without
@@ -447,7 +448,8 @@ No vote weight (parameter beta in the scoring function) 
(default value
 is 8.0).
 .TP
 .B \-\-xsize
-Strip abundance information from the headers when writing the output file.
+Strip abundance information from the headers when writing the output
+file.
 .RE
 .PP
 .\" 
----------------------------------------------------------------------------
@@ -541,10 +543,13 @@ pairwise alignment.
 .TP
 .BI \-\-msaout \0filename
 Output a multiple sequence alignment and a consensus sequence for each
-cluster to \fIfilename\fR, in fasta format. The consensus sequence is
-constructed by taking the majority symbol (nucleotide or gap) from
-each column of the alignment. Columns containing a majority of gaps
-are skipped, except for terminal gaps.
+cluster to \fIfilename\fR, in fasta format. Be warned that vsearch
+computes center star multiple sequence alignments using a fast method
+whose accuracy can decrease significantly when using low pairwise
+identity thresholds. The consensus sequence is constructed by taking
+the majority symbol (nucleotide or gap) from each column of the
+alignment. Columns containing a majority of gaps are skipped, except
+for terminal gaps.
 .TP
 .BI \-\-profile \0filename
 Output a sequence profile to a text file with the frequency of each
@@ -623,7 +628,9 @@ just a decreasing length ordering.
 Strip abundance information from the headers when writing the output
 file.
 .TP
-Most searching options as well as score filtering, gap penalties and masking 
also apply to clustering (see the Searching section):
+Most searching options as well as score filtering, gap penalties and
+masking also apply to clustering (see the Searching section for
+definitions):
 .br
 \-\-alnout, \-\-blast6out, \-\-fastapairs, \-\-matched,
 \-\-notmatched, \-\-maxaccept, \-\-maxreject, \-\-samout, \-\-userout,
@@ -644,12 +651,12 @@ are considered the same).
 Merge sequences with identical prefixes contained in \fIfilename\fR.
 A short sequence identical to an initial segment (prefix) of another
 sequence is considered a replicate of the longer sequence. If a
-sequence is identical to the prefix of two or more longer sequences
-it is undefined which of the longer sequences it will be considered
-a replicate of. This has consequences regarding the total abundance
-reported with \-\-sizeout. When considering whether two sequence
-are identical, the comparison is case insensitive, and T and U
-are considered identical.
+sequence is identical to the prefix of two or more longer sequences,
+it will be clustered with the shortest of them. If they are equally
+long, it will be clustered with the most abundant. Remaining ties are
+solved using sequence headers and sequence input order. Sequence
+comparisons are case insensitive, and T and U are considered
+identical.
 .TP
 .BI \-\-maxuniquesize\~ "positive integer"
 Discard sequences with an abundance value greater than \fIinteger\fR.
@@ -660,9 +667,9 @@ Discard sequences with an abundance value smaller than 
\fIinteger\fR.
 .BI \-\-output \0filename
 Write the dereplicated sequences to \fIfilename\fR, in fasta format
 and sorted by decreasing abundance. Identical sequences receive the
-header of the first sequence of their group. If \-\-sizeout is used, the
-number of occurrences (i.e. abundance) of each sequence is indicated
-at the end of their fasta header using the pattern
+header of the first sequence of their group. If \-\-sizeout is used,
+the number of occurrences (i.e. abundance) of each sequence is
+indicated at the end of their fasta header using the pattern
 ";size=\fIinteger\fR;".
 .TP
 .BI \-\-relabel \0string
@@ -718,7 +725,8 @@ dereplication, the option \-\-uc_allhits has no effect on 
the \-\-uc
 output.
 .TP
 .B \-\-xsize
-Strip abundance information from the headers when writing the output file.
+Strip abundance information from the headers when writing the output
+file.
 .RE
 .PP
 .\" 
----------------------------------------------------------------------------
@@ -740,9 +748,9 @@ complement sequences.
 .PP
 .TP 9
 .B \-\-eeout
-For \-\-fastq_filter or \-\-fastq_mergepairs, include the
-number of expected errors (ee) in the sequence header of FASTQ and FASTA
-files. This option is a synonym of the \-\-fastq_eeout option.
+When using \-\-fastq_filter or \-\-fastq_mergepairs, include the
+number of expected errors (ee) in the sequence header of FASTQ and
+FASTA files. This option is a synonym of the \-\-fastq_eeout option.
 .TP
 .BI \-\-eetabbedout \0filename
 When specified with the \-\-fastq_mergepairs command, write statistics
@@ -755,24 +763,24 @@ of errors are the number of differences in the overlap 
region of the
 merged sequence relative to each of the reads in the pair.
 .TP
 .BI \-\-fastaout \0filename
-Write to the given FASTA-formatted file the sequences that passes the
-filter of the \-\-fastq_filter command, or the merged sequences from
-the \-\-fastq_mergepairs command.
+When using \-\-fastq_filter or \-\-fastq_mergepairs, write to the
+given FASTA-formatted file the sequences passing the filter, or the
+merged sequences.
 .TP
 .BI \-\-fastaout_notmerged_fwd \0filename
-For \-\-fastq_mergepairs, write forward reads not merged to the
+When using \-\-fastq_mergepairs, write forward reads not merged to the
 specified FASTA file.
 .TP
 .BI \-\-fastaout_notmerged_rev \0filename
-For \-\-fastq_mergepairs, write reverse reads not merged to the
+When using \-\-fastq_mergepairs, write reverse reads not merged to the
 specified FASTA file.
 .TP
 .BI \-\-fastaout_discarded \0filename
-Write sequences that does not pass the filter of the \-\-fastq_filter
+Write sequences that do not pass the filter of the \-\-fastq_filter
 command to the given FASTA-formatted file.
 .TP
 .B \-\-fastq_allowmergestagger
-Allow the \-\-fastq_mergepairs command to merge staggered read
+When using \-\-fastq_mergepairs, allow to merge staggered read
 pairs. Staggered pairs are pairs where the 3' end of the reverse read
 has an overhang to the left of the 5' end of the forward read. This
 situation can occur when a very short fragment is sequenced. The 3'
@@ -781,31 +789,32 @@ sequence. The opposite option is the \-\-fastq_nostagger 
option. The
 default is to discard staggered pairs.
 .TP
 .BI \-\-fastq_ascii\~ "positive integer"
-The ASCII character number used as the basis for the FASTQ quality
-score. The default is 33, which is used by the Sanger / Illumina 1.8+
-FASTQ format (phred+33). The value 64 is used by the Solexa, Illumina
-1.3+ and Illumina 1.5+ format (phred+64).
+Define the ASCII character number used as the basis for the FASTQ
+quality score. The default is 33, which is used by the Sanger /
+Illumina 1.8+ FASTQ format (phred+33). The value 64 is used by the
+Solexa, Illumina 1.3+ and Illumina 1.5+ formats (phred+64).
 .TP
 .BI \-\-fastq_asciiout\~ "positive integer"
-The ASCII character number used as the basis for the FASTQ quality
-score when writing FASTQ output files. The default is 33. Applies only
-to \-\-fastq_convert.
+When using \-\-fastq_convert, define the ASCII character number used
+as the basis for the FASTQ quality score when writing FASTQ output
+files. The default is 33.
 .TP
 .BI \-\-fastq_chars \0filename
-Try to automatically detect the type of FASTQ file given (Solexa,
-Illumina 1.3+, Illumina 1.5+ or Illumina 1.8+/Sanger) by analysing the
-range of quality score values used.  Summarize the composition of
-sequence and quality strings contained in the input FASTQ file. For
-each of the four DNA letters, \-\-fastq_chars gives the number of
-occurrences of the letter, its relative frequency and the length of
-the longest run of that letter. For each character present in the
-quality strings, \-\-fastq_chars gives the ASCII value of the
-character, its relative frequency, and the number of times a k-mer of
-that character appears at the end of quality strings. The length of
-the k-mer can be set using \-\-fastq_tail (4 by default). The command
-will suggest values for the \-\-fastq_ascii, \-\-fastq_qmin and
-\-\-fastq_qmax options to be used with the other commands that require
-a FASTQ input file.
+Summarize the composition of sequence and quality strings contained in
+the input FASTQ file. For each of the four DNA letters,
+\-\-fastq_chars gives the number of occurrences of the letter, its
+relative frequency and the length of the longest run of that
+letter. For each character present in the quality strings,
+\-\-fastq_chars gives the ASCII value of the character, its relative
+frequency, and the number of times a \fIk\fR-mer of that character
+appears at the end of quality strings. The length of the \fIk\fR-mer
+can be set using \-\-fastq_tail (4 by default). The command
+\-\-fastq_chars tries to automatically detect the type of FASTQ file
+given (Solexa, Illumina 1.3+, Illumina 1.5+ or Illumina 1.8+/Sanger)
+by analyzing the range of observed quality score values. In case of
+success, \-\-fastq_chars suggests values for the \-\-fastq_ascii (33
+or 64), \-\-fastq_qmin and \-\-fastq_qmax options to be used with the
+other commands that require a FASTQ input file.
 .TP
 .BI \-\-fastq_convert \0filename
 Convert between the different variants of the FASTQ file format. The
@@ -817,26 +826,26 @@ mimimum and maximum output quality scores may be limited 
using the
 specified with the \-\-fastqout option.
 .TP
 .B \-\-fastq_eeout
-For \-\-fastq_filter or \-\-fastq_mergepairs, include the
-number of expected errors (ee) in the sequence header of FASTQ and FASTA
-files. This option is a synonym of the \-\-eeout option.
+When using \-\-fastq_filter or \-\-fastq_mergepairs, include the
+number of expected errors (ee) in the sequence header of FASTQ and
+FASTA files. This option is a synonym of the \-\-eeout option.
 .TP
 .BI \-\-fastq_eestats \0filename
-Analyse a FASTQ file and report statistics on the sequence lengths,
-distribution of quality scores, error probabilities and expected
-accumulated errors. The output is a file of tab-separated values with
-one line for each position. The values included are the 1-based
-position, number of sequences that includes the position, percentage
-of sequences that include the position, followed by columns that
-includes information about the distribution of quality scores in each
-position, error probabilities in each position, and finally the
-expected number of accumulated errors from the beginning of the
-sequence and until the current position.  For each of these
-distributions, the following statistics are included: minimum value,
-lower quartile, median, mean, upper quartile, and maximum value. The
-type of FASTQ file may be specified with \-\-fastq_ascii
-\-\-fastq_qmin and \-\-fastq_qmax. The output is written to the output
-file specified with the \-\-output option.
+Analyze a FASTQ file and report statistics on the distributions of
+quality scores, error probabilities and expected accumulated
+errors. The report, a table of 21 tab-separated columns, is written to
+the file specified with the \-\-output option. The first column
+corresponds to the position in the reads (Pos). The second and third
+columns correspond to the number of reads (Reads) and percentage of
+reads (PctRecs) that include this position. The remaining columns
+include information about the distribution of quality scores in this
+position (Q), error probabilities in this position (Pe), and finally
+the expected number of accumulated errors from the beginning of the
+reads and until the current position (EE). For each of the Q, Pe and
+EE distributions, the following statistics are included: minimum value
+(Min), lower quartile (Low), median (Med), mean (Mean), upper quartile
+(Hi), and maximum value (Max). The type of FASTQ file may be specified
+with \-\-fastq_ascii \-\-fastq_qmin and \-\-fastq_qmax.
 .TP
 .BI \-\-fastq_filter \0filename
 Shorten and/or filter the sequences in the given FASTQ file and output
@@ -851,32 +860,33 @@ options \-\-fastq_stripleft, \-\-fastq_trunclen, and
 \-\-fastq_minlen and \-\-fastq_trunclen. If shortening results in an
 empty sequence, it will be discarded. The sequences are first
 shortened and then filtered based on the remaining bases. If no
-shortening or filtering options are givem, all sequences will be
+shortening or filtering options are given, all sequences will be
 written to the output files, possibly after conversion from FASTQ to
 FASTA format. The \-\-relabel option may be used to relabel the output
 sequences. The \-\-eeout may be used to output the expected number of
 errors in each sequence.
 .TP
 .BI \-\-fastq_maxdiffs\~ "positive integer"
-Specify the maximum number of non-matching nucleotides allowed in the
-overlap region with the \-\-fastq_mergepairs command. The default
-limit is 5.
+When using \-\-fastq_mergepairs, specify the maximum number of
+non-matching nucleotides allowed in the overlap region. That option
+has a strong influence on the merging success rate. The default
+value is 5.
 .TP
 .BI \-\-fastq_maxee\~ real
-With the \-\-fastq_filter and \-\-fastq_mergepairs commands, discard
-sequences with more than the specified number of expected errors.
+When using \-\-fastq_filter or \-\-fastq_mergepairs, discard sequences
+with more than the specified number of expected errors.
 .TP
 .BI \-\-fastq_maxee_rate\~ real
-With the \-\-fastq_filter command, discard sequences with more than
-the specified number of expected errors per base.
+When using \-\-fastq_filter, discard sequences with more than the
+specified number of expected errors per base.
 .TP
 .BI \-\-fastq_maxmergelen\~ "positive integer"
-Specify the maximum length of the merged sequence with the
-\-\-fastq_mergepairs command. By default there is no limit.
+When using \-\-fastq_mergepairs, specify the maximum length of the
+merged sequence. By default there is no limit.
 .TP
 .BI \-\-fastq_maxns\~ "positive integer"
-With the \-\-fastq_filter and \-\-fastq_mergepairs commands, discard
-sequences with more than the specified number of N's.
+When using \-\-fastq_filter or \-\-fastq_mergepairs, discard sequences
+with more than the specified number of N's.
 .TP
 .BI \-\-fastq_mergepairs\0 filename
 Merge paired-end sequence reads into one sequence. The method has some
@@ -907,32 +917,32 @@ respectively. Other relevant options are: \-\-fastq_ascii,
 and \-\-label_suffix.
 .TP
 .BI \-\-fastq_minlen\~ "positive integer"
-With the \-\-fastq_filter and \-\-fastq_mergepairs commands, discard
-sequences with less than the specified number of bases (default 1).
+When using \-\-fastq_filter or \-\-fastq_mergepairs, discard sequences
+with less than the specified number of bases (default 1).
 .TP
 .BI \-\-fastq_minmergelen\~ "positive integer"
-Specifity the minimum length of the merged sequence with the
-\-\-fastq_mergepairs command. The default is 1.
+When using \-\-fastq_mergepairs, specify the minimum length of the
+merged sequence. The default is 1.
 .TP
 .BI \-\-fastq_minovlen\~ "positive integer"
-Specifify the minimum overlap between the merged reads with the
-\-\-fastq_mergepairs command. The default is 10.
+When using \-\-fastq_mergepairs, specify the minimum overlap between
+the merged reads. The default is 10.
 .TP
 .B \-\-fastq_nostagger
-Disallow the \-\-fastq_mergepairs command to merge staggered read
-pairs. This is the default, so this option is ignored. See the
-\-\-fastq_allowmergestagger option for details.
+When using \-\-fastq_mergepairs, forbid the merging of staggered read
+pairs. This is the default behaviour of \-\-fastq_mergepairs. To
+change that behaviour, see the \-\-fastq_allowmergestagger option.
 .TP
 .BI \-\-fastq_qmax\~ "positive integer"
 Specify the maximum quality score accepted when reading FASTQ
-files. The default is 41, which is usual for recent Sanger/Illumina 1.8+
-files.
+files. The default is 41, which is usual for recent Sanger/Illumina
+1.8+ files.
 .TP
 .BI \-\-fastq_qmaxout\~ "positive integer"
-Specify the maximum quality score used when writing FASTQ files. The
-default is 41, which is usual for recent Sanger/Illumina 1.8+
-files. Older formats may use a maximum quality score of 40. Applies
-only to \-\-fastq_convert.
+When using \-\-fastq_convert, specify the maximum quality score used
+when writing FASTQ files. The default is 41, which is usual for recent
+Sanger/Illumina 1.8+ files. Older formats may use a maximum quality
+score of 40.
 .TP
 .BI \-\-fastq_qmin\~ "positive integer"
 Specify the minimum quality score accepted for FASTQ files. The
@@ -940,63 +950,147 @@ default is 0, which is usual for recent Sanger/Illumina 
1.8+
 files. Older formats may use scores between -5 and 2.
 .TP
 .BI \-\-fastq_qminout\~ "positive integer"
-Specify the minimum quality score used when writing FASTQ
-files. The default is 0, which is usual for Sanger/Illumina 1.8+
-files. Older versions of the format may use values between -5 and
-2. Applies only to \-\-fastq_convert.
+When using \-\-fastq_convert, specify the minimum quality score used
+when writing FASTQ files. The default is 0, which is usual for
+Sanger/Illumina 1.8+ files. Older versions of the format may use
+scores between -5 and 2.
 .TP
 .BI \-\-fastq_stats \0filename
-Analyse a FASTQ file and report various statistics on the sequence
-length and quality score distribution. The type of FASTQ file may be
-specified with \-\-fastq_ascii \-\-fastq_qmin and \-\-fastq_qmax. The
-output is written to the log file specified with the \-\-log option.
+Analyze a FASTQ file and report the number of reads it contains. The
+type of FASTQ file may be specified with \-\-fastq_ascii
+\-\-fastq_qmin and \-\-fastq_qmax. That command requires the \-\-log
+option and outputs the following detailed statistics on read length,
+quality score, length vs. quality distributions, and length / quality
+filtering:
+.RS
+.TP
+Read length distribution:
+.RS
+.nr step 1 1
+.IP \n[step]. 4
+L: read length.
+.IP \n+[step].
+N: number of reads.
+.IP \n+[step].
+Pct: fraction of reads with this length.
+.IP \n+[step]:
+AccPct: fraction of reads with this length or longer.
+.RE
+.TP
+Quality score distribution:
+.RS
+.nr step 1 1
+.IP \n[step]. 4
+ASCII: character encoding the quality score.
+.IP \n+[step].
+Q: Phred quality score.
+.IP \n+[step].
+Pe: probability of error associated with the quality score.
+.IP \n+[step].
+N: number of bases with this quality score.
+.IP \n+[step].
+Pct: fraction of bases with this quality score.
+.IP \n+[step]:
+AccPct: fraction of bases with this quality score or higher.
+.RE
+.TP
+Length vs. quality distribution:
+.RS
+.nr step 1 1
+.IP \n[step]. 4
+L: position in reads (starting from position 2).
+.IP \n+[step].
+PctRecs: fraction of reads with at least this length.
+.IP \n+[step].
+AvgQ: average quality score over all reads up to this position.
+.IP \n+[step].
+P(AvgQ): error probability corresponding to AvgQ.
+.IP \n+[step].
+AvgP: average error probability.
+.IP \n+[step]:
+AvgEE: average expected error over all reads up to this position.
+.IP \n+[step]:
+Rate: growth rate of AvgEE between this position and position - 1.
+.IP \n+[step]:
+RatePct: Rate (as explained above) expressed as a percentage.
+.RE
+.TP
+Effect of expected error and length filtering
+.RS
+The first column indicates read lengths (\fIL\fR). The next four
+columns indicate the number of reads that would be retained by the
+\-\-fastq_filter command if the reads were truncated at length \fIL\fR
+(option \-\-fastq_trunclen \fIL\fR) and filtered to have a maximum
+expected error of 1.0, 0.5, 0.25 or 0.1 (option \-\-fastq_maxee
+\fIfloat\fR). The last four columns indicate the fraction of reads
+that would be retained by the \-\-fastq_filter command using the same
+length and maximum expected error parameters.
+.RE
+.TP
+Effect of minimum quality and length filtering
+.RS
+The first column indicates read lengths (\fILen\fR). The next four
+columns indicate the fraction of reads that would be retained by the
+\-\-fastq_filter command if the reads were truncated at length
+\fILen\fR (option \-\-fastq_trunclen \fILen\fR) or at the first
+position with a quality \fIQ\fR below 5, 10, 15 or 20 (option
+\-\-fastq_truncqual \fIQ\fR).
+.RE
+.RE
 .TP
 .BI \-\-fastq_stripleft\~ "positive integer"
-The specified number of bases on the left are deleted when using the
-\-\-fastq_filter command.
+When using \-\-fastq_filter, strip the specified number of bases from
+the beginning of the reads.
 .TP
 .BI \-\-fastq_tail\~ "positive integer"
-For the \-\-fastq_chars command, the length of the sequence tails with
-the identical quality score to count (default 4).
+When using \-\-fastq_chars, count the number of times a series of
+characters of length \fIk\fR appears at the end of quality strings. By
+default, \fIk\fR = 4.
 .TP
 .BI \-\-fastq_trunclen\~ "positive integer"
-Truncate sequences to the specified length when using
-\-\-fastq_filter. Sequences that are shorter will be
-discarded.
+When using \-\-fastq_filter, truncate sequences to the specified
+length. Sequences that are shorter will be discarded.
 .TP
 .BI \-\-fastq_truncqual\~ "positive integer"
-Truncate sequences starting from the first base with the specified
-base quality score value or lower. Empty sequences will be discarded.
+When using \-\-fastq_filter, truncate sequences starting from the
+first base with the specified base quality score value or lower. Empty
+sequences will be discarded.
 .TP
 .BI \-\-fastqout \0filename
-Write to the given FASTQ-formatted file the sequences that passes the
-filter of the \-\-fastq_filter command, or the merged sequences from
-the \-\-fastq_mergepairs command.
+When using \-\-fastq_filter or \-\-fastq_mergepairs, write to the
+given FASTQ-formatted file the sequences passing the filter, or the
+merged sequences.
 .TP
 .BI \-\-fastqout_discarded \0filename
-Write sequences that does not pass the filter of the \-\-fastq_filter
-command to the given FASTQ-formatted file.
+When using \-\-fastq_filter, write sequences that do not pass the
+filter to the given FASTQ-formatted file.
 .TP
 .BI \-\-fastqout_notmerged_fwd \0filename
-For \-\-fastq_mergepairs, write forward reads not merged to the
+When using \-\-fastq_mergepairs, write forward reads not merged to the
 specified FASTQ file.
 .TP
 .BI \-\-fastqout_notmerged_rev \0filename
-For \-\-fastq_mergepairs, write reverse reads not merged to the
+When using \-\-fastq_mergepairs, write reverse reads not merged to the
 specified FASTQ file.
 .TP
 .BI \-\-fastx_revcomp \0filename
 Reverse-complement the sequences in the given FASTA or FASTQ file to a
-files specified with the \-\-fastaout and/or \-\-fastqout options. If
+file specified with the \-\-fastaout and/or \-\-fastqout options. If
 the input file is in FASTA format, the output can not be written back
 to a FASTQ file due to missing base quality scores.
 .TP
 .BI \-\-label_suffix\~ string
-Label to add as a suffix to the header in the output from
-\-\-fastx_revcomp and \-\-fastq_mergepairs.
+When using \-\-fastx_revcomp or \-\-fastq_mergepairs, add the suffix
+\fIstring\fR to sequence headers.
+.TP
+.BI \-\-output \0filename
+When using \-\-fastq_eestats, write tabulated results to
+\fIfilename\fR. See \-\-fastq_eestats's documentation for a complete
+description of the table.
 .TP
 .B \-\-relabel_keep
-When relabelling, keep the old identifier in the header after a space.
+When using \-\-relabel, keep the old identifier in the header after a
+space.
 .TP
 .BI \-\-relabel \0string
 Please see the description of the same option under Chimera detection
@@ -1011,8 +1105,8 @@ Please see the description of the same option under 
Chimera detection
 for details.
 .TP
 .BI \-\-reverse \0filename
-Specify the FASTQ file containing containing the reverse reads for the
-\-\-fastq_mergepairs command.
+When using \-\-fastq_mergepairs, specify the FASTQ file containing
+containing the reverse reads.
 .TP
 .B \-\-xsize
 Strip abundance information from the headers when writing the output file.
@@ -1067,6 +1161,7 @@ results of combined masking options \-\-qmask (or 
\-\-dbmask for
 database sequences) and \-\-hardmask, assuming each input sequence
 contains both lower and uppercase nucleotides:
 .PP
+.ce 10 \# center the table (10 lines)
 .TS
 tab(:);
 c c c
@@ -1080,6 +1175,7 @@ dust:on:masked symbols changed to Ns, rest unchanged
 soft:off:lowercase symbols masked, no case changes
 soft:on:lowercase symbols masked and changed to Ns
 .TE
+.ce 0 \# end of centering
 .PP
 .TP 9
 .BI \-\-fastaout \0filename
@@ -1541,7 +1637,10 @@ length.  Internal or terminal gaps are not taken into 
account.
 .TP
 .B \-\-top_hits_only
 Output only the hits with the highest percentage of identity with the
-query.
+query. That option modifies the output of the options \-\-alnout,
+\-\-samout, \-\-userout, \-\-blast6out, \-\-uc, \-\-fastapairs,
+\-\-matched and \-\-notmatched, but not \-\-dbmatched and
+\-\-dbnotmatched
 .TP
 .BI \-\-uc \0filename
 Output searching results in \fIfilename\fR using a uclust-like
@@ -1613,7 +1712,8 @@ Pseudo-randomly shuffle the order of sequences contained 
in
 Output only the top \fIinteger\fR sequences.
 .TP
 .B \-\-xsize
-Strip abundance information from the headers when writing the output file.
+Strip abundance information from the headers when writing the output
+file.
 .RE
 .PP
 .\" 
----------------------------------------------------------------------------
@@ -1676,7 +1776,8 @@ Output only the top \fIinteger\fR sequences (i.e. the 
longest or the
 most abundant).
 .TP
 .B \-\-xsize
-Strip abundance information from the headers when writing the output file.
+Strip abundance information from the headers when writing the output
+file.
 .RE
 .PP
 .\" 
----------------------------------------------------------------------------
@@ -1686,21 +1787,35 @@ Subsampling will randomly extract a certain number or a 
certain
 percentage of the sequences in the input file. If the \-\-sizein
 option is in effect, the abundances of the input sequences will be
 taken into account and the sampling will be performed from the input
-sequences as if they had not be dereplicated. The extraction is
+sequences as if they had not been dereplicated. The extraction is
 performed as a random sampling with a uniform distribution among the
-input sequences and is performed without replacement.  The input file
-is specified with \-\-fastx_subsample option, the output file is
-specified with the \-\-fastaout option and the amount of sequences to
-be sampled is specified with the \-\-sample_pct or \-\-sample_size
-options.
+input sequences and is performed without replacement. The input file
+is specified with \-\-fastx_subsample option, the output files are
+specified with the \-\-fastaout and \-\-fastqout options and the
+amount of sequences to be sampled is specified with the \-\-sample_pct
+or \-\-sample_size options. The sequences not sampled may be written
+to files specified with the options \-\-fasta_discarded and
+\-\-fastq_discarded. The \-\-fastq_ascii, \-\-fastq_qmin and
+\-\-fastq_qmax options are also available.
 .PP
 .TP 9
 .BI \-\-fastaout \0filename
 Write the sampled sequences to \fIfilename\fR, in fasta format.
 .TP
+.BI \-\-fastaout_discarded \0filename
+Write the sequences not sampled to \fIfilename\fR, in fasta format.
+.TP
+.BI \-\-fastqout \0filename
+Write the sampled sequences to \fIfilename\fR, in fastq
+format. Requires input in fastq format.
+.TP
+.BI \-\-fastqout_discarded \0filename
+Write the sequences not sampled to \fIfilename\fR, in fastq
+format. Requires input in fastq format.
+.TP
 .BI \-\-fastx_subsample \0filename
 Perform subsampling from the sequences in the specified input file
-that is in FASTA format.
+that is in FASTA or FASTQ format.
 .TP
 .BI \-\-randseed\~ "positive integer"
 Use \fIinteger\fR as a seed for the pseudo-random generator. A given
@@ -1709,20 +1824,22 @@ replicability. Set to 0 to use a pseudo-random seed 
(default
 behavior).
 .TP
 .BI \-\-sample_pct\~ "real"
-Subsample the given percentage of the input sequences.
+Subsample the given percentage of the input sequences. Accepted values
+range from 0.0 to 100.0.
 .TP
 .BI \-\-sample_size\~ "positive integer"
 Extract the given number of sequences.
 .TP
 .B \-\-sizein
-Take the abundance information of the input file into account, otherwise
-the abundance of each sequence is considered to be 1.
+Take the abundance information of the input file into account,
+otherwise the abundance of each sequence is considered to be 1.
 .TP
 .B \-\-sizeout
 Write abundance information to the output file.
 .TP
 .B \-\-xsize
-Strip abundance information from the headers when writing the output file.
+Strip abundance information from the headers when writing the output
+file.
 .RE
 .PP
 .\" 
----------------------------------------------------------------------------
@@ -1998,9 +2115,8 @@ maxuniquesize (dereplication)
 .IP -
 profile (clustering)
 .IP -
-relabel_md5
-.IP -
-relabel_sha1
+relabel_md5 and relabel_sha1 (chimera detection, dereplication, FASTQ
+processing, shuffling, sorting)
 .IP -
 shuffle (shuffling)
 .PP
@@ -2097,16 +2213,27 @@ only sequences with an abundance equal to or greater 
than 2:
 .SH AUTHORS
 Implementation by Torbjørn Rognes and Tomás Flouri, documentation by
 Frédéric Mahé.
+.PP
+.\" 
============================================================================
+.SH CITATION
+.PP
+Rognes T, Flouri T, Nichols B, Quince C, Mahé F. (2016)
+VSEARCH: a versatile open source tool for metagenomics.
+\fIPeerJ Preprints\fR 4:e2409v1
+<https://doi.org/10.7287/peerj.preprints.2409v1>
+.PP
 .\" 
============================================================================
 .SH REPORTING BUGS
 Submit suggestions and bug-reports at
 <https://github.com/torognes/vsearch/issues>, send a pull request on
 <https://github.com/torognes/vsearch>, or compose a friendly or
 curmudgeont e-mail to Torbjørn Rognes <torog...@ifi.uio.no>.
+.PP
 .\" 
============================================================================
 .SH AVAILABILITY
 Source code and binaries are available at
 <https://github.com/torognes/vsearch>.
+.PP
 .\" 
============================================================================
 .SH COPYRIGHT
 Copyright (C) 2014-2016, Torbjørn Rognes, Frédéric Mahé and Tomás Flouri
@@ -2188,6 +2315,7 @@ copyright Jean-Loup Gailly and Mark Adler.
 .PP
 \fBvsearch\fR binaries may include code from the bzip2 library,
 copyright Julian R. Seward.
+.PP
 .\" 
============================================================================
 .SH SEE ALSO
 \fBswipe\fR, an extremely fast pairwise local (Smith-Waterman)
@@ -2197,6 +2325,7 @@ database search tool by Torbjørn Rognes, available at
 \fBswarm\fR, a fast and accurate amplicon clustering method by
 Frédéric Mahé and Torbjørn Rognes, available at
 <https://github.com/torognes/swarm>.
+.PP
 .\" 
============================================================================
 .SH VERSION HISTORY
 New features and important modifications of \fBvsearch\fR (short lived
@@ -2313,7 +2442,8 @@ Fix bug with large datasets. Fix format of help info.
 Fix more bugs with large datasets.
 .TP
 .BR v1.2.0-1.2.19\~ "released July 6th to September 8th, 2015"
-Several new commands and options added. Bugs fixed. Documentation updated.
+Several new commands and options added. Bugs fixed. Documentation
+updated.
 .TP
 .BR v1.3.0\~ "released September 9th, 2015"
 Changed to autotools build system.
@@ -2381,8 +2511,8 @@ word lengths are also set. The minimum word length is 
increased to 7.
 .BR v1.6.0\~ "released October 9th, 2015"
 This version adds the relabeling options (\-\-relabel, \-\-relabel_md5
 and \-\-relabel_sha1) to the shuffle command. It also adds the
-\-\-xsize option to the clustering, dereplication, shuffling and sorting
-commands.
+\-\-xsize option to the clustering, dereplication, shuffling and
+sorting commands.
 .TP
 .BR v1.6.1\~ "released October 14th, 2015"
 Fix bugs and update manual and help text regarding relabelling. Add
@@ -2395,8 +2525,9 @@ Add \-\-relabel_keep option.
 .TP
 .BR v1.8.0\~ "released October 19th, 2015"
 Added \-\-search_exact, \-\-fastx_mask and \-\-fastq_convert commands.
-Changed most commands to read FASTQ input files as well as FASTA files.
-Modified \-\-fastx_revcomp and \-\-fastx_subsample to write FASTQ files.
+Changed most commands to read FASTQ input files as well as FASTA
+files.  Modified \-\-fastx_revcomp and \-\-fastx_subsample to write
+FASTQ files.
 .TP
 .BR v1.8.1\~ "released November 2nd, 2015"
 Fixes for compatibility with QIIME and older OS X versions.
@@ -2420,7 +2551,8 @@ Fixed a bug in the computation of some values with 
\-\-fastq_stats.
 Workaround for missing x86intrin.h with old compilers.
 .TP
 .BR v1.9.4\~ "released December 3rd, 2015"
-Fixed incrementation of counter when relabeling dereplicated sequences.
+Fixed incrementation of counter when relabeling dereplicated
+sequences.
 .TP
 .BR v1.9.5\~ "released December 3rd, 2015"
 Fixed bug resulting in inferior chimera detection performance.
@@ -2492,9 +2624,9 @@ options must be specified if reading compressed input 
from a pipe, but
 are not required when reading from ordinary files. The vsearch header
 that was previously written to stdout is now written to stderr. This
 enables piping of results for further processing. The file name '\-'
-now represent standard input (/dev/stdin) or standard outout
+now represent standard input (/dev/stdin) or standard output
 (/dev/stdout) when reading or writing files, respectively. Code for
-reading FASTA and FASTQ files have been refactored.
+reading FASTA and FASTQ files has been refactored.
 .TP
 .BR v2.0.1\~ "released June 30, 2016"
 Avoid segmentation fault when masking very long sequences.
@@ -2505,6 +2637,13 @@ Avoid warnings when compiling with GCC 6.
 .BR v2.0.3\~ "released August 2, 2016"
 Fixed bad compiler options resulting in Illegal instruction errors
 when running precompiled binaries.
+.TP
+.BR v2.0.4\~ "released September 1, 2016"
+Improved error message for bad FASTQ quality values. Improved manual.
+.TP
+.BR v2.0.5\~ "released September 9, 2016"
+Added options to output discarded sequences from subsampling to
+separate files. Updated manual.
 .RE
 .LP
 .\" 
============================================================================
diff --git a/src/fastqops.cc b/src/fastqops.cc
index 4721e3b..fdf9950 100644
--- a/src/fastqops.cc
+++ b/src/fastqops.cc
@@ -621,7 +621,19 @@ void fastq_stats()
 
           int qual = qc - opt_fastq_ascii;
           if ((qual < opt_fastq_qmin) || (qual > opt_fastq_qmax))
-            fatal("FASTQ quality value out of range");
+            {
+              char * msg;
+              if (asprintf(& msg,
+"FASTQ quality value (%d) out of range (%d-%d).\n"
+"Please adjust the FASTQ quality base character or range with the\n"
+"--fastq_ascii, --fastq_qmin or --fastq_qmax options. For a complete\n"
+"diagnosis with suggested values, please run vsearch --fastq_chars file.",
+                           qual, opt_fastq_qmin, opt_fastq_qmax) > 0)
+                fatal(msg);
+              else
+                fatal("Out of memory");
+              free(msg);
+            }
           
           quality_chars[qc]++;
           if (qc < qmin)
diff --git a/src/subsample.cc b/src/subsample.cc
index cb15f78..118bb7c 100644
--- a/src/subsample.cc
+++ b/src/subsample.cc
@@ -63,7 +63,9 @@
 void subsample()
 {
   FILE * fp_fastaout = 0;
+  FILE * fp_fastaout_discarded = 0;
   FILE * fp_fastqout = 0;
+  FILE * fp_fastqout_discarded = 0;
 
   if (opt_fastaout)
     {
@@ -72,6 +74,13 @@ void subsample()
         fatal("Unable to open fasta output file for writing");
     }
 
+  if (opt_fastaout_discarded)
+    {
+      fp_fastaout_discarded = fopen(opt_fastaout_discarded, "w");
+      if (!fp_fastaout_discarded)
+        fatal("Unable to open fasta output file for writing");
+    }
+
   if (opt_fastqout)
     {
       fp_fastqout = fopen(opt_fastqout, "w");
@@ -79,10 +88,17 @@ void subsample()
         fatal("Unable to open fastq output file for writing");
     }
 
+  if (opt_fastqout_discarded)
+    {
+      fp_fastqout_discarded = fopen(opt_fastqout_discarded, "w");
+      if (!fp_fastqout_discarded)
+        fatal("Unable to open fastq output file for writing");
+    }
+
   db_read(opt_fastx_subsample, 0);
   show_rusage();
 
-  if (fp_fastqout && ! db_is_fastq())
+  if ((fp_fastqout || fp_fastqout_discarded) && ! db_is_fastq())
     fatal("Cannot write FASTQ output with a FASTA input file, lacking quality 
scores");
 
   int dbsequencecount = db_getsequencecount();
@@ -149,10 +165,14 @@ void subsample()
   progress_done();
 
   int samples = 0;
+  int discarded = 0;
   progress_init("Writing output", dbsequencecount);
   for(int i=0; i<dbsequencecount; i++)
     {
-      if (abundance[i]>0)
+      long ab_sub = abundance[i];
+      long ab_discarded = (opt_sizein ? db_getabundance(i) : 1) - ab_sub;
+
+      if (ab_sub > 0)
         {
           samples++;
 
@@ -162,7 +182,7 @@ void subsample()
                                 db_getsequencelen(i),
                                 db_getheader(i),
                                 db_getheaderlen(i),
-                                abundance[i],
+                                ab_sub,
                                 samples);
 
           if (opt_fastqout)
@@ -172,9 +192,33 @@ void subsample()
                                 db_getheader(i),
                                 db_getheaderlen(i),
                                 db_getquality(i),
-                                abundance[i],
+                                ab_sub,
                                 samples);
         }
+
+      if (ab_discarded > 0)
+        {
+          discarded++;
+
+          if (opt_fastaout_discarded)
+            fasta_print_relabel(fp_fastaout_discarded,
+                                db_getsequence(i),
+                                db_getsequencelen(i),
+                                db_getheader(i),
+                                db_getheaderlen(i),
+                                ab_discarded,
+                                discarded);
+
+          if (opt_fastqout_discarded)
+            fastq_print_relabel(fp_fastqout_discarded,
+                                db_getsequence(i),
+                                db_getsequencelen(i),
+                                db_getheader(i),
+                                db_getheaderlen(i),
+                                db_getquality(i),
+                                ab_discarded,
+                                discarded);
+        }
       progress_update(i);
     }
   progress_done();
@@ -190,4 +234,10 @@ void subsample()
 
   if (opt_fastqout)
     fclose(fp_fastqout);
+
+  if (opt_fastaout_discarded)
+    fclose(fp_fastaout_discarded);
+
+  if (opt_fastqout_discarded)
+    fclose(fp_fastqout_discarded);
 }

-- 
Alioth's /usr/local/bin/git-commit-notice on 
/srv/git.debian.org/git/debian-med/vsearch.git

_______________________________________________
debian-med-commit mailing list
debian-med-commit@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/debian-med-commit

Reply via email to