[Samtools-help] Release 1.17

Robert Davies Tue, 21 Feb 2023 06:40:04 -0800

Samtools (and HTSlib and BCFtools) version 1.17 is now available from
GitHub and SourceForge.


https://github.com/samtools/htslib/releases/tag/1.17
https://github.com/samtools/samtools/releases/tag/1.17

https://github.com/samtools/bcftools/releases/tag/1.17https://sourceforge.net/projects/samtools/


The main changes are listed below:

------------------------------------------------------------------------------
htslib - changes v1.17
------------------------------------------------------------------------------

* A new API for iterating through a BAM record's aux field. (PR#1354,
  addresses #1319.  Thanks to John Marshall)

* Text mode for bgzip. Allows bgzip to compress lines of text with block
  breaks at newlines. (PR#1493, thanks to Mike Lin for the initial version
  PR#1369)

* Make tabix support CSI indices with large positions.  Unlike SAM and VCF
  files, BED files do not set a maximum reference length which hindered CSI
  support.  This change sets an arbitrary large size of 100G to enable it to
  work. (PR#1506)

* Add a fai_line_length function.  Exposes the internal line-wrap length.
  (PR#1516)

* Check for invalid barcode tags in fastq output. (PR#1518, fixes
  samtools#1728.  Reported by Poshi)

* Warn if reference found in a CRAM file is not contained in the specified
  reference file. (PR#1517 and PR#1521, adds diagnostics for #1515. Reported
  by Wei WeiDeng)

* Add a faidx_seq_len64 function that can return sequence lengths longer than
  INT_MAX.  At the same time limit faidx_seq_len to INT_MAX output.  Also add
  a fai_adjust_region to ensure given ranges do not go beyond the end of the
  requested sequence. (PR#1519)

* Add a bcf_strerror function to give text descriptions of BCF errors.
  (PR#1510)

* Add CRAM SQ/M5 header checking when specifying a fasta file.  This is
  to prevent creating a CRAM that cannot be decoded again. (PR#1522.  In
  response to samtools#1748 though not a direct fix)

* Improve support for very long input lines (> 2Gbyte).  This is mostly
  useful for tabix which does not do much interpretation of its input.
  (PR#1542, a partial fix for #1539)

* Speed up load_ref_portion.  This function has been sped up by about 7x,
  which speeds up low-depth CRAM decoding by about 10%. (PR#1551)

* Expand CRAM API to cope with new samtools cram_size command. (PR#1546)

* Merges neighbouring I and D ops into one op within pileup. This means
  4M1D1D1D3M is reported as 4M3D3M.   Fixing this in sam.c means not only is
  samtools mpileup now looking better, but any tool using the mpileup API
  will be getting consistent results. (PR#1552, fixes the last remaining
  part of samtools#139)

* Update the API documentation for bgzf_mt as it refered to a previous
  iteration. (PR#1556, fixes #1553.  Reported by Raghavendra Padmanabhan)

Build changes
-------------

* Use POSIX grep in testing as egrep and fgrep are considered obsolete.
  (PR#1509, thanks to David Seifert)

* Switch to building libdefalte with cmake for Cirris CI. (PR#1511)

* Ensure strings in config_vars.h are escaped correctly. (PR#1530, fixes
  #1527. Reported by Lucas Czech)

* Easier modification of shared library permissions during install. (PR#1532,
  fixes #1525. Reported by StephDC)

* Fix build on ancient compilers.  Added -std=gnu90 to build tests so older C
  compilers will still be happy. (PR#1524, fixes #1523.  Reported by
  Martin Jakt)

* Switch MacOS CI tests to an ARM-based image. (PR#1536)

* Cut down the number of embed_ref=2 tests that get run. (PR#1537)

* Add symbol versions to libhts.so.  This is to aid package developers.
  (PR#1560 addresses #1505, thanks to John Marshall. Reported by
  Stefan Bruens)

* htscodecs now updated to v1.4.0. (PR#1563)

* Cleaned up misleading system error reports in test_bgzf. (PR#1565)

Bug fixes
---------

* VCF. Fix n-squared complexity in sample line with many adjacent tabs
  [fuzz]. (PR#1503)

* Improved bcftools detection and reporting of bgzf decode errors. (PR#1504,
  thanks to Lilian Janin. PR#1529 thanks to Bergur Ragnarsson, fixes #1528.
  PR#1554)

* Prevent crash when the only FASTA entry has no sequence [fuzz]. (PR#1507)

* Fixed typo in sam.h documentation. (PR#1512, thanks to kojix2)

* Fix buffer read-overrun in bam_plp_insertion_mod. (PR#1520)

* Fix hash keys being left behind by bcf_hdr_remove. (PR#1535, fixes #1533.
  Reported by Giulio Genovese in #842)

* Make bcf_hdr_idinfo_exists more robust by checking id value exists.
  (PR#1544, fixes #1538.  Reported by Giulio Genovese)

* CRAM improvements. Fixed crash with multi-threaded CRAM.  Fixed a bug
  in the codec parameter learning for CRAM 3.1 name tokeniser. Fixed Cram
  compression container substitution matrix generation, (PR#1558, PR#1559
  and PR#1562)

------------------------------------------------------------------------------
samtools - changes v1.17
------------------------------------------------------------------------------

New work and changes:

* New samtools reset subcommand.  Removes alignment information.  Alignment
  location, CIGAR, mate mapping and flags are updated. If the alignment was
  in reverse direction, sequence and its quality values are reversed and
  complemented and the reverse flag is reset.  Supplementary and secondary
  alignment data are discarded. (PR#1767, implements #1682. Requested by dkj)

* New samtools cram-size subcommand.  It writes out metrics about a CRAM file
  reporting aggregate sizes per block "Content ID" fields, the data-series
  contained within them, and the compression methods used. (PR#1777)

* Added a --sanitize option to fixmate and view.  This performs some sanity
  checks on the state of SAM record fields, fixing up common mistakes made by
  aligners. (PR#1698)

* Permit 1 thread with samtools view.  All other subcommands already allow
  this and it does provide a modest speed increase. (PR#1755, fixes #1743.
  Reported by Goran Vinterhalter)

* Add CRAM_OPT_REQUIRED_FIELDS option for view -c.  This is a big speed up
  for CRAM (maybe 5-fold), but it depends on which filtering options are
  being used. (PR#1776, fixes #1775. Reported by Chang Y)

* New filtering options in samtools depth.  The new --excl-flags option is a
  synonym for -G, with --incl-flags and --require-flags added to match view
  logic. (PR#1718, fixes #1702. Reported by Dario Beraldi)

* Speed up calmd's slow handling of non-position-sorted data by adding
  caching. This uses more memory but is only activated when needed.
  (PR#1723, fixes #1595. Reported by lxwgcool)

* Improve samtools consensus for platforms with instrument specific
  profiles, considerably helping for data with very different indel error
  models and providing base quality recalibration tables. On PacBio HiFi,
  ONT and  Ultima Genomics consensus qualities are also redistributed
  within homopolymers and the likelihood of nearby indel errors is raised.
  (PR#1721, PR#1733)

* Consensus --mark-ins option.  This permits he consensus output to include a
  markup indicating the next base is an insertion. This is necessary as we
  need a way of outputting both consensus and also how that consensus marries
  up with the reference coordinates. (PR#1746)

* Make faidx/fqidx output line length default to the input line length.
  (PR#1738, fixes #1734. Reported by John Marshall)

* Speed up optical duplicate checking where data has a lot of duplicates
  compared to non-duplicates. (PR#1779, fixes #1771. Reported by Poshi)

* For collate use TMPDIR environment variable, when looking for a temporary
  folder. (PR#1782, based on PR#1178 and fixes #1172.  Reported by
  Martin Pollard)

Bug Fixes:

* Fix stats breakage on long deletions when given a reference. (PR#1712,
  fixes #1707. Reported by John Didion)

* In ampliconclip, stop hard clipping from wrongly removing entire reads.
  (PR#1722, fixes #1717. Reported by Kevin Xu)

* Fix bug in ampliconstats where references mentioned in the input file
  headers but not in the bed file would cause it to complain that the SAM
  headers were inconsistent. (PR#1727, fixes #1650. Reported by jPontix)

* Fixed SEGV in samtools collate when no filename given. (PR#1724)

* Changed the default UMI barcode regex in markdup.  The old regex was too
  restrictive.  This version will at least allow the default read name UMI
  as given in the Illumina example documentation. (PR#1737, fixes #1730.
  Reported by yloemie)

* Fix samtools consensus buffer overrun with MD:Z handling. (PR#1745, fixes
  #1744. Reported by trilisser)

* Fix a buffer read-overflow in mpileup and tview on sequences with seq "*".
  (PR#1747)

* Fix view -X command line parsing that was broken in 1.15. (PR#1772, fixes
  #1720.  Reported by Francisco Rodríguez-Algarra and Miguel Machado)

* Stop samtools view -d from reporting meaningless system errors when tag
  validation fails. (PR#1796)

Documentation:

* Add a description of the samtools tview display layout to the man page.
  Documents . vs , and upper vs lowercase. Adds a -s sample example, and
  documents the -w option. (PR#1765, fixes #1759. Reported by
  Lucas Ferreira da Silva)

* Clarify intention of samtools fasta/q in man page and soft vs hard
  clipping. (PR#1794, fixes #1792. Reported by Ryan Lorig-Roach)

* Minor fix to wording of mpileup --rf usage and man page. (PR#1795, fixes
  #1791. Reported by Luka Pavageau)

Non user-visible changes and build improvements:

* Use POSIX grep in testing as egrep and fgrep are considered obsolete.
  (PR#1726, thanks to David Seifert)

* Switch MacOS CI tests to an ARM-based image. (PR#1770)

------------------------------------------------------------------------------
bcftools - changes v1.17
------------------------------------------------------------------------------

Changes affecting the whole of bcftools, or multiple commands:

* The -i/-e filtering expressions

    - Error checks were added to prevent incorrect use of vector arithmetics.
      For example, when evaluating the sum of two vectors A and B, the
      resulting vector could contain nonsense values when the input vectors
      were not of the same length. The fix introduces the following logic:

        - evaluate to C_i = A_i + B_i when length(A)==B(A) and set
          length(C)=length(A)

        - evaluate to C_i = A_i + B_0 when length(B)=1 and set
          length(C)=length(A)

        - evaluate to C_i = A_0 + B_i when length(A)=1 and set
          length(C)=length(B)

        - throw an error when length(A)!=length(B) AND length(A)!=1 AND
          length(B)!=1

    - Arrays in Number=R tags can be now subscripted by alleles found in
      FORMAT/GT. For example,

 FORMAT/AD[GT] > 10        .. require support of more than 10 reads for
                              each allele
 FORMAT/AD[0:GT] > 10      .. same as above, but in the first sample
 sSUM(FORMAT/AD[GT]) > 20  .. require total sample depth bigger than 20

* The commands `consensus -H` and `+split-vep -H`

    - Drop unnecessary leading space in the first header column and newly
      print `#[1]columnName` instead of the previous `# [1]columnName`
      (#1856)

Changes affecting specific commands:

* bcftools +allele-length

    - Fix overflow for indels longer than 512bp and aggregate alleles equal
      or larger than that in the same bin (#1837)

* bcftools annotate

    - Support sample reordering of annotation file (#1785)

    - Restore lost functionality of the --pair-logic option (#1808)

* bcftools call

    - Fix a bug where too many alleles passed to `-C alleles` via `-T` caused
      memory corruption (#1790)

    - Fix a bug where indels constrained with `-C alleles -T` would sometimes
      be missed (#1706)

* bcftools consensus

    - BREAKING CHANGE: the option `-I, --iupac-codes` newly outputs IUPAC
      codes based on FORMAT/GT of all samples. The `-s, --samples` and `-S,
      --samples-file` options can be used to subset samples. In order to
      ignore samples and consider only the REF and ALT columns (the original
      behavior prior to 1.17), run with `-s -` (#1828)

* bcftools convert

    - Make variantkey conversion work for sites without an ALT allele (#1806)

* bcftool csq

    - Fix a bug where a MNV with multiple consequences (e.g. missense +
      stop_gained) would report only the less severe one (#1810)

    - GFF file parsing was made slightly more flexible, newly ids can be just
      'XXX' rather than, for example, 'gene:XXX'

    - New gff2gff perl script to fix GFF formatting differences

* bcftools +fill-tags

    - More of the available annotations are now added by the `-t all` option

* bcftools +fixref

    - New INFO/FIXREF annotation

    - New -m swap mode

* bcftools +mendelian

    - The +mendelian plugin has been deprecated and replaced with
      +mendelian2. The function of the plugin is the same but the command
      line options and the output format has changed, and for this was
      introduced as a new plugin.

* bcftools mpileup

    - Most of the annotations generated by mpileup are now optional via the
      `-a, --annotate` option and add several new (mostly experimental)
      annotations.

    - New option `--indels-2.0` for an EXPERIMENTAL indel calling model.
      This model aims to address some known deficiencies of the current
      indel calling algorithm, specifically, it uses diploid reference
      consensus sequence. Note that in the current version it has the
      potential to increase sensitivity but at the cost of decreased
      specificity.

    - Make the FS annotation (Fisher exact test strand bias) functional and
      remove it from the default annotations

* bcftools norm

    - New --multi-overlaps option allows to set overlapping alleles either to
      the ref allele (the current default) or to a missing allele (#1764 and
      #1802)

    - Fixed a bug in `-m -` which does not split missing FORMAT values
      correctly and could lead to empty FORMAT fields such as `::` instead
      of the correct `:.:` (#1818)

    - The `--atomize` option previously would not split complex indels such
      as C>GGG. Newly these will be split into two records C>G and C>CGG
      (#1832)

* bcftools query

    - Fix a rare bug where the printing of SAMPLE field with `query` was
      incorrectly suppressed when the `-e` option contained a sample
      expression while the formatting query did not. See #1783 for details.

* bcftools +setGT

    - Add new `--new-gt X` option (#1800)

    - Add new `--target-gt r:FLOAT` option to randomly select a proportion of
      genotypes (#1850)

    - Fix a bug where `-t ./x` mode was advertised as selecting both phased
      and unphased half-missing genotypes, but was in fact selecting only
      unphased genotypes (#1844)

* bcftools +split-vep

    - New options `-g, --gene-list` and `--gene-list-fields` which allow to
      prioritize consequences from a list of genes, or restrict output to the
      listed genes

    - New `-H, --print-header` option to print the header with `-f`

    - Work around a bug in the LOFTEE VEP plugin used to annotate gnomAD
      VCFs. There the LoF_info subfield contains commas which, in
      general, makes it impossible to parse the VEP subfields. The
      +split-vep plugin can now work with such files, replacing the
      offending commas with slash (/) characters. See also
      https://github.com/Ensembl/ensembl-vep/issues/1351

    - Newly the `-c, --columns` option can be omitted when a subfield is used
      in `-i/-e` filtering expression. Note that `-c` may still have to be
      given when it is not possible to infer the type of the subfield. Note
      that this is an experimental feature.

* bcftools stats

    - The per-sample stats (PSC) would not be computed when `-i/-e` filtering
      options and the `-s -` option were given but the expression did not
      include sample columns (1835)

* bcftools +tag2tag

    - Revamp of the plugin to allow wider range of tag conversions,
      specifically all combinations from FORMAT/GL,PL,GP to
      FORMAT/GL,PL,GP,GT

* bcftools +trio-dnm2

    - New `-n, --strictly-novel` option to downplay alleles which violate
      Mendelian inheritance but are not novel

    - Allow to set the `--pn` and `--pns` options separately for SNVs and
      indels and make the indel settings more strict by default

    - Output missing FORMAT/VAF values in non-trio samples, rather than
      random nonsense values

* bcftools +variant-distance

    - New option `-d, --direction` to choose the directionality: forward,
      reverse, nearest (the default) or both (#1829)



--

The Wellcome Sanger Institute is operated by Genome ResearchLimited, a charity registered in England with number 1021457 and acompany registered in England with number 2742969, whose registeredoffice is 215 Euston Road, London, NW1 2BE.

_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help

[Samtools-help] Release 1.17

Reply via email to