On 16 Apr 2018, at 15:55, Heredia Genestar, Txema <txema.here...@upf.edu> wrote:
> HTSLIB 1.5 and 1.8 both have a bug in tabix. It ignores the option for 
> 0-based input files.

You are right that there is awkwardness here, but the full story is rather more 
involved.

The -0 option is functioning as designed. Like the other options specifying the 
columns of interest within the data file to be indexed (-b, -e, -p, etc), -0 is 
only effective when tabix is being used to create an index for a data file, not 
when querying via an existing index.

It's fairly clear in the documentation that -0 doesn't affect the 
interpretation of *query* coordinates, whether in a -R file or as command-line 
arguments.

> Using tabix -0 -R returns 3 entries. One snp (17491115) for the first bed 
> line (17491114-17491115) and two (17491115 & 17491116) for the second bed 
> line:
> 
> $tabix -0 vcf -R <(echo -e 
> "chr22\t17491114\t17491115\nchr22\t17491115\t17491116") | awk -v OFS="\t" 
> '{print $1,$2,$3,$4,$5,$6,$7}'
> chr22 17491115        .       C       T       1189.33 PASS
> chr22 17491115        .       C       T       1189.33 PASS
> chr22 17491116        .       G       A       16942.9 PASS

Here -0 is ignored as it is an indexing option (so as tabix currently stands, 
it would be good if trying to use -0 or other indexing-specific options during 
querying produced an error message).

However if you store your -R coordinates in a file named test.bed, then "tabix 
-R test.bed vcf" will work as you expect and return two lines. This is as 
documented, as -R interprets the file's contents as BED or as 1-based 
TAB-delimited depending on the filename extension. But there's no way to 
effectively control the filename (if there even is any!) when you're using <() 
or pipes or etc.

This is a pretty good demonstration of why having tools decide how to interpret 
ambiguous [1] file contents based solely on filename extensions is unwise. :-)

So for your use case, it would be good if tabix querying had an option to force 
interpretation of the -R file as BED. (Or if BED were the default here, but 
that ship has sadly sailed.) It would probably be least confusing if the name 
for that option was also -0. So the awkwardness you've encountered could be 
fixed by having tabix also use -0 as a querying option:

  -0, --zero-based, --bed   treat -R (and -T?) file as a BED file

Thus, as an addition feature, tabix would also understand -0 when querying with 
a -R file.

> The second position returns two variants with/without -0, even when giving 
> the positions as string:
> 
> $tabix -0 vcf chr22:17491115-17491116 | awk -v OFS="\t" '{print 
> $1,$2,$3,$4,$5,$6,$7}'
> chr22 17491115        .       C       T       1189.33 PASS
> chr22 17491116        .       G       A       16942.9 PASS
> 
> $ tabix vcf chr22:17491115-17491116 | awk -v OFS="\t" '{print 
> $1,$2,$3,$4,$5,$6,$7}'
> chr22 17491115        .       C       T       1189.33 PASS
> chr22 17491116        .       G       A       16942.9 PASS

OTOH throughout samtools and bcftools this SEQ:START-END way of specifying a 
range of coordinates is always used as a human-readable textual range, so it 
would be confusing to interpret it other than as a 1-based inclusive range. So 
I would not recommend having -0 affect the interpretation of this 
human-readable syntax.

    John


[1] BED and "1-based TAB-delimited" look identical.

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to