On 16 Apr 2018, at 15:55, Heredia Genestar, Txema <txema.here...@upf.edu> wrote: > HTSLIB 1.5 and 1.8 both have a bug in tabix. It ignores the option for > 0-based input files.
You are right that there is awkwardness here, but the full story is rather more involved. The -0 option is functioning as designed. Like the other options specifying the columns of interest within the data file to be indexed (-b, -e, -p, etc), -0 is only effective when tabix is being used to create an index for a data file, not when querying via an existing index. It's fairly clear in the documentation that -0 doesn't affect the interpretation of *query* coordinates, whether in a -R file or as command-line arguments. > Using tabix -0 -R returns 3 entries. One snp (17491115) for the first bed > line (17491114-17491115) and two (17491115 & 17491116) for the second bed > line: > > $tabix -0 vcf -R <(echo -e > "chr22\t17491114\t17491115\nchr22\t17491115\t17491116") | awk -v OFS="\t" > '{print $1,$2,$3,$4,$5,$6,$7}' > chr22 17491115 . C T 1189.33 PASS > chr22 17491115 . C T 1189.33 PASS > chr22 17491116 . G A 16942.9 PASS Here -0 is ignored as it is an indexing option (so as tabix currently stands, it would be good if trying to use -0 or other indexing-specific options during querying produced an error message). However if you store your -R coordinates in a file named test.bed, then "tabix -R test.bed vcf" will work as you expect and return two lines. This is as documented, as -R interprets the file's contents as BED or as 1-based TAB-delimited depending on the filename extension. But there's no way to effectively control the filename (if there even is any!) when you're using <() or pipes or etc. This is a pretty good demonstration of why having tools decide how to interpret ambiguous [1] file contents based solely on filename extensions is unwise. :-) So for your use case, it would be good if tabix querying had an option to force interpretation of the -R file as BED. (Or if BED were the default here, but that ship has sadly sailed.) It would probably be least confusing if the name for that option was also -0. So the awkwardness you've encountered could be fixed by having tabix also use -0 as a querying option: -0, --zero-based, --bed treat -R (and -T?) file as a BED file Thus, as an addition feature, tabix would also understand -0 when querying with a -R file. > The second position returns two variants with/without -0, even when giving > the positions as string: > > $tabix -0 vcf chr22:17491115-17491116 | awk -v OFS="\t" '{print > $1,$2,$3,$4,$5,$6,$7}' > chr22 17491115 . C T 1189.33 PASS > chr22 17491116 . G A 16942.9 PASS > > $ tabix vcf chr22:17491115-17491116 | awk -v OFS="\t" '{print > $1,$2,$3,$4,$5,$6,$7}' > chr22 17491115 . C T 1189.33 PASS > chr22 17491116 . G A 16942.9 PASS OTOH throughout samtools and bcftools this SEQ:START-END way of specifying a range of coordinates is always used as a human-readable textual range, so it would be confusing to interpret it other than as a 1-based inclusive range. So I would not recommend having -0 affect the interpretation of this human-readable syntax. John [1] BED and "1-based TAB-delimited" look identical. ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help