Samtools (and HTSlib and BCFtools) version 1.12 is now available from
GitHub and SourceForge.
https://sourceforge.net/projects/samtools/
https://github.com/samtools/htslib/releases/tag/1.12
https://github.com/samtools/samtools/releases/tag/1.12
https://github.com/samtools/bcftools/releases/tag/1.12
The main changes are listed below:
------------------------------------------------------------------------------
htslib - changes v1.12
------------------------------------------------------------------------------
Features and Updates
--------------------
* Added experimental CRAM 3.1 and 4.0 support. (#929)
These should not be used for long term data storage as the specification
still needs to be ratified by GA4GH and may be subject to changes in format.
(This is highly likely for 4.0). However it may be tested using:
test/test_view -t ref.fa -C -o version=3.1 in.bam -p out31.cram
For smaller but slower files, try varying the compression profile with an
additional "-o small". Profile choices are fast, normal, small and archive,
and can be applied to all CRAM versions.
* Added a general filtering syntax for alignment records in SAM/BAM/CRAM
readers. (#1181, #1203)
An example to find chromosome spanning read-pairs with high mapping quality:
'mqual >= 30 && mrname != rname'
To find significant sized deletions: 'cigar =~ "[0-9]{2}D"' or
'rlen - qlen > 10'.
To report duplicates that aren't part of a "proper pair":
'flag.dup && !flag.proper_pair'
More details are in the samtools.1 man page under "FILTER EXPRESSIONS".
* The knet networking code has been removed. It only supported the http and
ftp protocols, and a better and safer alternative using libcurl has been
available since release 1.3. If you need access to ftp:// and http://
URLs, HTSlib should be built with libcurl support. (#1200)
* The old htslib/knetfile.h interfaces have been marked as deprecated. Any
code still using them should be updated to use hFILE instead. (#1200)
* Added an introspection API for checking some of the capabilities provided
by HTSlib. (#1170) Thanks also to John Marshall for contributions. (#1222)
- `hfile_list_schemes`: returns the number of schemes found
- `hfile_list_plugins`: returns the number of plugins found
- `hfile_has_plugin`: checks if a specific plugin is available
- `hts_features`: returns a bit mask with all available features
- `hts_test_feature`: test if a feature is available
- `hts_feature_string`: return a string summary of enabled features
* Made performance improvements to `probaln_glocal` method, which speeds up
mpileup BAQ calculations. (#1188)
- Caching of reused loop variables and removal of loop invariants
- Code reordering to remove instruction latency.
- Other refactoring and tidyups.
* Added a public method for constructing a BAM record from the component
pieces. Thanks to Anders Kaplan. (#1159, #1164)
* Added two public methods, `sam_parse_cigar` and `bam_parse_cigar`, as part
of a small CIGAR API (#1169, #1182). Thanks to Daniel Cameron for input.
(#1147)
* HTSlib, and the included htsfile program, will now recognise the old RAZF
compressed file format. Note that while the format is detected, HTSlib is
unable to read it. It is recommended that RAZF files are uncompressed with
`gunzip` before using them with HTSlib. Thanks to John Marshall (#1244);
and Matthew J. Oldach who reported problems with uncompressing some RAZF
files (samtools/samtools#1387).
* The S3 plugin now has options to force the address style. It will
recognise the addressing_style and host_bucket entries in the
respective aws .credentials and s3cmd .s3cfg files. There is also a
new HTS_S3_ADDRESS_STYLE environment variable. Details are in the
htslib-s3-plugin.7 man file (#1249).
Build changes
-------------
These are compiler, configuration and makefile based changes.
* Added new Makefile targets for the applications that embed HTSlib and want
to run its test suite or clean its generated artefacts. (#1230, #1238)
* The CRAM codecs are now obtained via the htscodecs submodule, hence when
cloning it is now best to use "git clone --recursive". In an existing
clone, you may use "git submodule update --init" to obtain the htscodecs
submodule checkout.
* Updated CI test configuration to recurse HTSlib submodules. (#1359)
* Added Cirrus-CI integration as a replacement for Travis, which was phased
out. (#1175; #1212)
* Updated the Windows image used by Appveyor to 'Visual Studio 2019'. (#1172;
fixed #1166)
* Fixed a buglet in configure.ac, exposed by the release 2.70 of autoconf.
Thanks to John Marshall. (#1198)
* Fixed plugin linking on macOS, to prevent symbol conflict when linking with
a static HTSlib. Thanks to John Marshall. (#1184)
* Fixed a clang++9 error in `cram_io.h`. Thanks to Pjotr Prins. (#1190)
* Introduced $(ALL_CPPFLAGS) to allow for more flexibility in setting the
compiler flags. Thanks to John Marshall. (#1187)
* Added 'fall through' comments to prevent warnings issued by Clang on
intentional fall through case statements, when building with `-Wextra
flag`. Thanks to John Marshall. (#1163)
* Non-configure builds now define _XOPEN_SOURCE=600 to allow them to work
when the `gcc -std=c99` option is used. Thanks to John Marshall. (#1246)
Bug fixes
---------
* Fixed VCF `#CHROM` header parsing to only separate columns at tab
characters. Thanks to Sam Morris for reporting the issue. (#1237; fixed
samtools/bcftools#1408)
* Fixed a crash reported in `bcf_sr_sort_set`, which expects REF to be
present. (#1204; fixed samtools/bcftools#1361)
* Fixed a bcf synced reader bug when filtering with a region list, and the
first record for a chromosome had the same position as the last record for
the previous chromosome. (#1254; fixed samtools/bcftools#1441)
* Fixed a bug in the overlapping logic of mpileup, dealing with iterating
over CIGAR segments. Thanks to `@wulj2` for the analysis. (#1202; fixed
#1196)
* Fixed a tabix bug that prevented setting the correct number of lines to be
skipped in a region file. Thanks to Jim Robinson for reporting it. (#1189;
fixed #1186)
* Made `bam_itr_next` an alias for `sam_itr_next`, to prevent it from
crashing when working with htsFile pointers. Thanks to Torbjörn Klatt
for reporting it. (#1180; fixed #1179)
* Fixed once per outgoing multi-threaded block `bgzf_idx_flush` assertion, to
accommodate situations when a single record could span multiple blocks.
Thanks to `@lacek`. (#1168; fixed samtools/samtools#1328)
* Fixed assumption of pthread_t being a non-structure, as permitted by POSIX.
Thanks also to John Marshall and Anders Kaplan. (#1167, #1153, #1153)
* Fixed the minimum offset of a BAI index bin, to account for unmapped reads.
Thanks to John Marshall for spotting the issue. (#1158; fixed #1142)
* Fixed the CRLF handling in `sam_parse_worker` method. Thanks to
Anders Kaplan. (#1149; fixed #1148)
* Included unistd.h and errno.h directly in HTSlib files, as opposed to
including them indirectly, via third party code. Thanks to
Andrew Patterson (#1143) and John Marshall (#1145).
------------------------------------------------------------------------------
samtools - changes v1.12
------------------------------------------------------------------------------
* The legacy samtools API (libbam.a, bam.h, sam.h, etc) has not been
actively maintained since 2015. It is deprecated and will be removed
entirely in a future SAMtools release. We recommend coding against the
HTSlib API directly.
* I/O errors and record parsing errors during the reading of SAM/BAM/CRAM
files are now always detected. Thanks to John Marshall (#1379; fixed #101)
* New make targets have been added: check-all, test-all, distclean-all,
mostlyclean-all, testclean-all, which allow SAMtools installations to
call corresponding Makefile targets from embedded HTSlib installations.
* samtools --version now displays a summary of the compilation details and
available features, including flags, used libraries and enabled plugins
from HTSlib. As an alias, `samtools version` can also be used. (#1371)
* samtools stats now displays the number of supplementary reads in the SN
section. Also, supplementary reads are no longer considered when splitting
read pairs by orientation (inward, outward, other). (#1363)
* samtools stats now counts only the filtered alignments that overlap target
regions, if any are specified. (#1363)
* samtools view now accepts option -N, which takes a file containing read
names of interest. This allows the output of only the reads with names
contained in the given file. Thanks to Daniel Cameron. (#1324)
* samtools view -d option now works without a tag associated value, which
allows it to output all the reads with the given tag. (#1339; fixed #1317)
* samtools view -d and -D options now accept integer and single character
values associated with tags, not just strings. Thanks to `@dariome` and
Keiran Raine for the suggestions. (#1357, #1392)
* samtools view now works with the filtering expressions introduced by
HTSlib. The filtering expression is passed to the program using the
specific option -e or the global long option --input-fmt-option. E.g.
samtools view -e 'qname =~ "#49$" && mrefid != refid && refid != -1 &&
mrefid != -1' align.bam
looks for records with query-name ending in `#49` that have their mate
aligned in a different chromosome. More details can be found in the
FILTER EXPRESSIONS section of the main man page. (#1346)
* samtools markdup now benefits from an increase in performance in the
situation when a single read has tens or hundreds of thousands of
duplicates. Thanks to `@denriquez` for reporting the issue. (#1345;
fixed #1325)
* The documentation for samtools ampliconstats has been added to the
samtools man page. (#1351)
* A new FASTA/FASTQ sanitizer script (`fasta-sanitize.pl`) was added, which
corrects the invalid characters in the reference names. (#1314) Thanks to
John Marshall for the installation fix. (#1353)
* The CI scripts have been updated to recurse the HTSlib submodules when
cloning HTSlib, to accommodate for the CRAM codecs, which now reside in
the htscodecs submodule. (#1359)
* The CI integrations now include Cirrus-CI rather than Travis. (#1335;
#1365)
* Updated the Windows image used by Appveyor to 'Visual Studio 2019'.
(#1333; fixed #1332)
* Fixed a bug in samtools cat, which prevented the command from running in
multi-threaded mode. Thanks to Alex Leonard for reporting the issue.
(#1337; fixed #1336)
* A couple of invalid CIGAR strings have been corrected in the test data.
(#1343)
* The documentation for `samtools depth -s` has been improved. Thanks to
`@wulj2`. (#1355)
* Fixed a `samtools merge` segmentation fault when it failed to merge header
`@PG` records. Thanks to John Marshall. (#1394; reported by Kemin Zhou in
#1393)
* Ampliconclip and ampliconstats now guard against the BED file containing
more than one reference (chromosome) and fail when found. Adding proper
support for multiple references will appear later. (#1398)
------------------------------------------------------------------------------
bcftools - changes v1.12
------------------------------------------------------------------------------
Changes affecting the whole of bcftools, or multiple commands:
* The output file type is determined from the output file name suffix, where
available, so the -O/--output-type option is often no longer necessary.
* Make F_MISSING in filtering expressions work for sites with multiple ALT
alleles (#1343)
* Fix N_PASS and F_PASS to behave according to expectation when reverse
logic is used (#1397). This fix has the side effect of `query` (or
programs like `+trio-stats`) behaving differently with these expressions,
operating now in site-oriented rather than sample-oriented mode. For
example, the new behavior could be:
bcftools query -f'[%POS %SAMPLE %GT\n]' -i'N_PASS(GT="alt")==1'
11 A 0/0
11 B 0/0
11 C 1/1
while previously the same expression would return:
11 C 1/1
The original mode can be mimicked by splitting the filtering into two steps:
bcftools view -i'N_PASS(GT="alt")==1' | \
bcftools query -f'[%POS %SAMPLE %GT\n]' -i'GT="alt"'
Changes affecting specific commands:
* bcftools annotate:
- New `--rename-annots` option to help fix broken VCFs (#1335)
- New -C option allows to read a long list of options from a file to
prevent very long command lines.
- New `append-missing` logic allows annotations to be added for each ALT
allele in the same order as they appear in the VCF. Note that this is
not bullet proof. In order for this to work:
- the annotation file must have one line per ALT allele
- fields must contain a single value as multiple values are appended
as they are and would break the correspondence between the alleles
and values
* bcftools concat:
- Do not phase genotypes by mistake if they are not already phased
with `-l` (#1346)
* bcftools consensus:
- New `--mask-with`, `--mark-del`, `--mark-ins`, `--mark-snv` options
(#1382, #1381, #1170)
- Symbolic <DEL> should have only one REF base. If there are multiple,
take POS+1 as the first deleted base.
- Make consensus work when the first base of the reference genome is
deleted. In this situation the VCF record has POS=1 and the first REF
base cannot precede the event. (#1330)
* bcftools +contrast:
- The NOVELGT annotation was previously not added when requested.
* bcftools convert:
- Make the --hapsample and --hapsample2vcf options consistent with each
other and with the documentation.
* bcftools call:
- Revamp of `call -G`, previously sample grouping by population was not
truly independent and could still be influenced by the presence of
other sample groups.
- Optional addition of INFO/PV4 annotation with `call -a INFO/PV4`
- Remove generation of useless HOB and ICB annotation; use
`+fill-tags -- -t HWE,ExcHet` instead
- The `call -f` option was renamed to `-a` to (1) make it consistent with
`mpileup` and (2) to indicate that it includes both INFO and FORMAT
annotations, not just FORMAT as previously
- Any sensible Number=R,Type=Integer annotation can be used with -G, such
as AD or QS
- Don't trim QUAL; although usefulness of this change is
questionable for true probabilistic interpretation (such high
precision is unrealistic), using QUAL as a score rather than
probability is helpful and permits more fine-grained filtering
- Fix a suspected bug in `call -F` in the worst case, for certain improve
readability
- `call -C trio` is temporarily disabled
* bcftools csq:
- Fix a bug wich caused incorrect FORMAT/BCSQ formatting at sites with
too many per-sample consequences
- Fix a bug which incorrectly handled the --ncsq parameter and could
clash with reserved BCF values, consequently producing truncated or
even incorrect output of the %TBCSQ formatting expression in `bcftools
query`. To account for the reserved values, the new default value is
--ncsq 15 (#1428)
* bcftools +fill-tags:
- MAF definition revised for multiallelic sites, the second most common
allele is considered to be the minor allele (#1313)
- New FORMAT/VAF, VAF1 annotations to set the fraction of alternate reads
provided FORMAT/AD is present
* bcftools gtcheck:
- support matching of a single sample against all other samples in the
file with `-s qry:sample -s gt:-`. This was previously not possible,
either full cross-check mode had to be run or a list of pairs/samples
had to be created explicitly
* bcftools merge:
- Make `merge -R` behavior consistent with other commands and pull in
overlapping records with POS outside of the regions (#1374)
- Bug fix (#1353)
* bcftools mpileup:
- Add new optional tag `mpileup -a FORMAT/QS`
* bcftools norm:
- New `-a, --atomize` functionality to decompose complex variants, for
example MNVs into consecutive SNVs
- New option `--old-rec-tag` to indicate the original variant
* bcftools query:
- Incorrect fields were printed in the per-sample output when subset of
samples was requested via -s/-S and the order of samples in the header
was different from the requested -s/-S order (#1435)
* bcftools +prune:
- New options --random-seed and --nsites-per-win-mode (#1050)
* bcftools +split-vep:
- Transcript selection now works also on the raw CSQ/BCSQ annotation.
- Bug fix, samples were dropped on VCF input and VCF/BCF output (#1349)
* bcftools stats:
- Changes to QUAL and ts/tv plotting stats: avoid capping QUAL to
predefined bins, use an open-range logarithmic binning instead
- plot dual ts/tv stats: per quality bin and cumulative as if threshold
applied on the whole dataset
* bcftools +trio-dnm2:
- Major revamp of +trio-dnm plugin, which is now deprecated and replaced
by +trio-dnm2.
The original trio-dnm calling model used genotype likelihoods (PLs) as
the input for calling. However, that is flawed because PLs make
assumptions which are unsuitable for de novo calling: PL(RR) can become
bigger than PL(RA) even when the ALT allele is present in the parents.
Note that this is true also for other programs such as DeNovoGear which
rely on the same samtools calculation.
The new recommended workflow is:
bcftools mpileup -a AD,QS -f ref.fa -Ou \
proband.bam father.bam mother.bam | \
bcftools call -mv -Ou | \
bcftools +trio-dnm -p proband,father,mother -Oz -o output.vcf.gz
This new version also implements the DeNovoGear model. The original
behavior of trio-dnm is no longer supported.
For more details see http://samtools.github.io/bcftools/trio-dnm.pdf
--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE. _______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help