Samtools (and HTSlib and BCFtools) version 1.17 is now available from
GitHub and SourceForge.
https://github.com/samtools/htslib/releases/tag/1.17
https://github.com/samtools/samtools/releases/tag/1.17
https://github.com/samtools/bcftools/releases/tag/1.17
https://sourceforge.net/projects/samtools/
The main changes are listed below:
------------------------------------------------------------------------------
htslib - changes v1.17
------------------------------------------------------------------------------
* A new API for iterating through a BAM record's aux field. (PR#1354,
addresses #1319. Thanks to John Marshall)
* Text mode for bgzip. Allows bgzip to compress lines of text with block
breaks at newlines. (PR#1493, thanks to Mike Lin for the initial version
PR#1369)
* Make tabix support CSI indices with large positions. Unlike SAM and VCF
files, BED files do not set a maximum reference length which hindered CSI
support. This change sets an arbitrary large size of 100G to enable it to
work. (PR#1506)
* Add a fai_line_length function. Exposes the internal line-wrap length.
(PR#1516)
* Check for invalid barcode tags in fastq output. (PR#1518, fixes
samtools#1728. Reported by Poshi)
* Warn if reference found in a CRAM file is not contained in the specified
reference file. (PR#1517 and PR#1521, adds diagnostics for #1515. Reported
by Wei WeiDeng)
* Add a faidx_seq_len64 function that can return sequence lengths longer than
INT_MAX. At the same time limit faidx_seq_len to INT_MAX output. Also add
a fai_adjust_region to ensure given ranges do not go beyond the end of the
requested sequence. (PR#1519)
* Add a bcf_strerror function to give text descriptions of BCF errors.
(PR#1510)
* Add CRAM SQ/M5 header checking when specifying a fasta file. This is
to prevent creating a CRAM that cannot be decoded again. (PR#1522. In
response to samtools#1748 though not a direct fix)
* Improve support for very long input lines (> 2Gbyte). This is mostly
useful for tabix which does not do much interpretation of its input.
(PR#1542, a partial fix for #1539)
* Speed up load_ref_portion. This function has been sped up by about 7x,
which speeds up low-depth CRAM decoding by about 10%. (PR#1551)
* Expand CRAM API to cope with new samtools cram_size command. (PR#1546)
* Merges neighbouring I and D ops into one op within pileup. This means
4M1D1D1D3M is reported as 4M3D3M. Fixing this in sam.c means not only is
samtools mpileup now looking better, but any tool using the mpileup API
will be getting consistent results. (PR#1552, fixes the last remaining
part of samtools#139)
* Update the API documentation for bgzf_mt as it refered to a previous
iteration. (PR#1556, fixes #1553. Reported by Raghavendra Padmanabhan)
Build changes
-------------
* Use POSIX grep in testing as egrep and fgrep are considered obsolete.
(PR#1509, thanks to David Seifert)
* Switch to building libdefalte with cmake for Cirris CI. (PR#1511)
* Ensure strings in config_vars.h are escaped correctly. (PR#1530, fixes
#1527. Reported by Lucas Czech)
* Easier modification of shared library permissions during install. (PR#1532,
fixes #1525. Reported by StephDC)
* Fix build on ancient compilers. Added -std=gnu90 to build tests so older C
compilers will still be happy. (PR#1524, fixes #1523. Reported by
Martin Jakt)
* Switch MacOS CI tests to an ARM-based image. (PR#1536)
* Cut down the number of embed_ref=2 tests that get run. (PR#1537)
* Add symbol versions to libhts.so. This is to aid package developers.
(PR#1560 addresses #1505, thanks to John Marshall. Reported by
Stefan Bruens)
* htscodecs now updated to v1.4.0. (PR#1563)
* Cleaned up misleading system error reports in test_bgzf. (PR#1565)
Bug fixes
---------
* VCF. Fix n-squared complexity in sample line with many adjacent tabs
[fuzz]. (PR#1503)
* Improved bcftools detection and reporting of bgzf decode errors. (PR#1504,
thanks to Lilian Janin. PR#1529 thanks to Bergur Ragnarsson, fixes #1528.
PR#1554)
* Prevent crash when the only FASTA entry has no sequence [fuzz]. (PR#1507)
* Fixed typo in sam.h documentation. (PR#1512, thanks to kojix2)
* Fix buffer read-overrun in bam_plp_insertion_mod. (PR#1520)
* Fix hash keys being left behind by bcf_hdr_remove. (PR#1535, fixes #1533.
Reported by Giulio Genovese in #842)
* Make bcf_hdr_idinfo_exists more robust by checking id value exists.
(PR#1544, fixes #1538. Reported by Giulio Genovese)
* CRAM improvements. Fixed crash with multi-threaded CRAM. Fixed a bug
in the codec parameter learning for CRAM 3.1 name tokeniser. Fixed Cram
compression container substitution matrix generation, (PR#1558, PR#1559
and PR#1562)
------------------------------------------------------------------------------
samtools - changes v1.17
------------------------------------------------------------------------------
New work and changes:
* New samtools reset subcommand. Removes alignment information. Alignment
location, CIGAR, mate mapping and flags are updated. If the alignment was
in reverse direction, sequence and its quality values are reversed and
complemented and the reverse flag is reset. Supplementary and secondary
alignment data are discarded. (PR#1767, implements #1682. Requested by dkj)
* New samtools cram-size subcommand. It writes out metrics about a CRAM file
reporting aggregate sizes per block "Content ID" fields, the data-series
contained within them, and the compression methods used. (PR#1777)
* Added a --sanitize option to fixmate and view. This performs some sanity
checks on the state of SAM record fields, fixing up common mistakes made by
aligners. (PR#1698)
* Permit 1 thread with samtools view. All other subcommands already allow
this and it does provide a modest speed increase. (PR#1755, fixes #1743.
Reported by Goran Vinterhalter)
* Add CRAM_OPT_REQUIRED_FIELDS option for view -c. This is a big speed up
for CRAM (maybe 5-fold), but it depends on which filtering options are
being used. (PR#1776, fixes #1775. Reported by Chang Y)
* New filtering options in samtools depth. The new --excl-flags option is a
synonym for -G, with --incl-flags and --require-flags added to match view
logic. (PR#1718, fixes #1702. Reported by Dario Beraldi)
* Speed up calmd's slow handling of non-position-sorted data by adding
caching. This uses more memory but is only activated when needed.
(PR#1723, fixes #1595. Reported by lxwgcool)
* Improve samtools consensus for platforms with instrument specific
profiles, considerably helping for data with very different indel error
models and providing base quality recalibration tables. On PacBio HiFi,
ONT and Ultima Genomics consensus qualities are also redistributed
within homopolymers and the likelihood of nearby indel errors is raised.
(PR#1721, PR#1733)
* Consensus --mark-ins option. This permits he consensus output to include a
markup indicating the next base is an insertion. This is necessary as we
need a way of outputting both consensus and also how that consensus marries
up with the reference coordinates. (PR#1746)
* Make faidx/fqidx output line length default to the input line length.
(PR#1738, fixes #1734. Reported by John Marshall)
* Speed up optical duplicate checking where data has a lot of duplicates
compared to non-duplicates. (PR#1779, fixes #1771. Reported by Poshi)
* For collate use TMPDIR environment variable, when looking for a temporary
folder. (PR#1782, based on PR#1178 and fixes #1172. Reported by
Martin Pollard)
Bug Fixes:
* Fix stats breakage on long deletions when given a reference. (PR#1712,
fixes #1707. Reported by John Didion)
* In ampliconclip, stop hard clipping from wrongly removing entire reads.
(PR#1722, fixes #1717. Reported by Kevin Xu)
* Fix bug in ampliconstats where references mentioned in the input file
headers but not in the bed file would cause it to complain that the SAM
headers were inconsistent. (PR#1727, fixes #1650. Reported by jPontix)
* Fixed SEGV in samtools collate when no filename given. (PR#1724)
* Changed the default UMI barcode regex in markdup. The old regex was too
restrictive. This version will at least allow the default read name UMI
as given in the Illumina example documentation. (PR#1737, fixes #1730.
Reported by yloemie)
* Fix samtools consensus buffer overrun with MD:Z handling. (PR#1745, fixes
#1744. Reported by trilisser)
* Fix a buffer read-overflow in mpileup and tview on sequences with seq "*".
(PR#1747)
* Fix view -X command line parsing that was broken in 1.15. (PR#1772, fixes
#1720. Reported by Francisco RodrÃguez-Algarra and Miguel Machado)
* Stop samtools view -d from reporting meaningless system errors when tag
validation fails. (PR#1796)
Documentation:
* Add a description of the samtools tview display layout to the man page.
Documents . vs , and upper vs lowercase. Adds a -s sample example, and
documents the -w option. (PR#1765, fixes #1759. Reported by
Lucas Ferreira da Silva)
* Clarify intention of samtools fasta/q in man page and soft vs hard
clipping. (PR#1794, fixes #1792. Reported by Ryan Lorig-Roach)
* Minor fix to wording of mpileup --rf usage and man page. (PR#1795, fixes
#1791. Reported by Luka Pavageau)
Non user-visible changes and build improvements:
* Use POSIX grep in testing as egrep and fgrep are considered obsolete.
(PR#1726, thanks to David Seifert)
* Switch MacOS CI tests to an ARM-based image. (PR#1770)
------------------------------------------------------------------------------
bcftools - changes v1.17
------------------------------------------------------------------------------
Changes affecting the whole of bcftools, or multiple commands:
* The -i/-e filtering expressions
- Error checks were added to prevent incorrect use of vector arithmetics.
For example, when evaluating the sum of two vectors A and B, the
resulting vector could contain nonsense values when the input vectors
were not of the same length. The fix introduces the following logic:
- evaluate to C_i = A_i + B_i when length(A)==B(A) and set
length(C)=length(A)
- evaluate to C_i = A_i + B_0 when length(B)=1 and set
length(C)=length(A)
- evaluate to C_i = A_0 + B_i when length(A)=1 and set
length(C)=length(B)
- throw an error when length(A)!=length(B) AND length(A)!=1 AND
length(B)!=1
- Arrays in Number=R tags can be now subscripted by alleles found in
FORMAT/GT. For example,
FORMAT/AD[GT] > 10 .. require support of more than 10 reads for
each allele
FORMAT/AD[0:GT] > 10 .. same as above, but in the first sample
sSUM(FORMAT/AD[GT]) > 20 .. require total sample depth bigger than 20
* The commands `consensus -H` and `+split-vep -H`
- Drop unnecessary leading space in the first header column and newly
print `#[1]columnName` instead of the previous `# [1]columnName`
(#1856)
Changes affecting specific commands:
* bcftools +allele-length
- Fix overflow for indels longer than 512bp and aggregate alleles equal
or larger than that in the same bin (#1837)
* bcftools annotate
- Support sample reordering of annotation file (#1785)
- Restore lost functionality of the --pair-logic option (#1808)
* bcftools call
- Fix a bug where too many alleles passed to `-C alleles` via `-T` caused
memory corruption (#1790)
- Fix a bug where indels constrained with `-C alleles -T` would sometimes
be missed (#1706)
* bcftools consensus
- BREAKING CHANGE: the option `-I, --iupac-codes` newly outputs IUPAC
codes based on FORMAT/GT of all samples. The `-s, --samples` and `-S,
--samples-file` options can be used to subset samples. In order to
ignore samples and consider only the REF and ALT columns (the original
behavior prior to 1.17), run with `-s -` (#1828)
* bcftools convert
- Make variantkey conversion work for sites without an ALT allele (#1806)
* bcftool csq
- Fix a bug where a MNV with multiple consequences (e.g. missense +
stop_gained) would report only the less severe one (#1810)
- GFF file parsing was made slightly more flexible, newly ids can be just
'XXX' rather than, for example, 'gene:XXX'
- New gff2gff perl script to fix GFF formatting differences
* bcftools +fill-tags
- More of the available annotations are now added by the `-t all` option
* bcftools +fixref
- New INFO/FIXREF annotation
- New -m swap mode
* bcftools +mendelian
- The +mendelian plugin has been deprecated and replaced with
+mendelian2. The function of the plugin is the same but the command
line options and the output format has changed, and for this was
introduced as a new plugin.
* bcftools mpileup
- Most of the annotations generated by mpileup are now optional via the
`-a, --annotate` option and add several new (mostly experimental)
annotations.
- New option `--indels-2.0` for an EXPERIMENTAL indel calling model.
This model aims to address some known deficiencies of the current
indel calling algorithm, specifically, it uses diploid reference
consensus sequence. Note that in the current version it has the
potential to increase sensitivity but at the cost of decreased
specificity.
- Make the FS annotation (Fisher exact test strand bias) functional and
remove it from the default annotations
* bcftools norm
- New --multi-overlaps option allows to set overlapping alleles either to
the ref allele (the current default) or to a missing allele (#1764 and
#1802)
- Fixed a bug in `-m -` which does not split missing FORMAT values
correctly and could lead to empty FORMAT fields such as `::` instead
of the correct `:.:` (#1818)
- The `--atomize` option previously would not split complex indels such
as C>GGG. Newly these will be split into two records C>G and C>CGG
(#1832)
* bcftools query
- Fix a rare bug where the printing of SAMPLE field with `query` was
incorrectly suppressed when the `-e` option contained a sample
expression while the formatting query did not. See #1783 for details.
* bcftools +setGT
- Add new `--new-gt X` option (#1800)
- Add new `--target-gt r:FLOAT` option to randomly select a proportion of
genotypes (#1850)
- Fix a bug where `-t ./x` mode was advertised as selecting both phased
and unphased half-missing genotypes, but was in fact selecting only
unphased genotypes (#1844)
* bcftools +split-vep
- New options `-g, --gene-list` and `--gene-list-fields` which allow to
prioritize consequences from a list of genes, or restrict output to the
listed genes
- New `-H, --print-header` option to print the header with `-f`
- Work around a bug in the LOFTEE VEP plugin used to annotate gnomAD
VCFs. There the LoF_info subfield contains commas which, in
general, makes it impossible to parse the VEP subfields. The
+split-vep plugin can now work with such files, replacing the
offending commas with slash (/) characters. See also
https://github.com/Ensembl/ensembl-vep/issues/1351
- Newly the `-c, --columns` option can be omitted when a subfield is used
in `-i/-e` filtering expression. Note that `-c` may still have to be
given when it is not possible to infer the type of the subfield. Note
that this is an experimental feature.
* bcftools stats
- The per-sample stats (PSC) would not be computed when `-i/-e` filtering
options and the `-s -` option were given but the expression did not
include sample columns (1835)
* bcftools +tag2tag
- Revamp of the plugin to allow wider range of tag conversions,
specifically all combinations from FORMAT/GL,PL,GP to
FORMAT/GL,PL,GP,GT
* bcftools +trio-dnm2
- New `-n, --strictly-novel` option to downplay alleles which violate
Mendelian inheritance but are not novel
- Allow to set the `--pn` and `--pns` options separately for SNVs and
indels and make the indel settings more strict by default
- Output missing FORMAT/VAF values in non-trio samples, rather than
random nonsense values
* bcftools +variant-distance
- New option `-d, --direction` to choose the directionality: forward,
reverse, nearest (the default) or both (#1829)
--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE. _______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help