Woah, that's worse than I thought then!

The -0 option is functioning as designed. Like the other options specifying
> the columns of interest within the data file to be indexed (-b, -e, -p,
> etc), -0 is only effective when tabix is being used to create an index for
> a data file, not when querying via an existing index.
>
> It's fairly clear in the documentation that -0 doesn't affect the
> interpretation of *query* coordinates, whether in a -R file or as
> command-line arguments.
>

I was coming from using tabix 0.2.5, where -0 is thrown in in the middle of
all options, with no mention to indexing only, so I supposed it applied to
your queries and not indexing. I began using newer versions (HTSLIB 1.5)
because old stand-alone tabix versions have a bug when combining -h and -B
options, where the first variant of the vcf is not returned

$ tabix -h ../chimp/vcf/vcf/chimp.chr22.filtered_chimp_gorilla.vcf.gz
chr22:16856866-16856867 | awk -v OFS="\t" '{print $1,$2,$3,$4,$5}' | grep
-v ^##
#CHROM POS ID REF ALT
chr22 16856867 . T C

vs

$ tabix -h ../chimp/vcf/vcf/chimp.chr22.filtered_chimp_gorilla.vcf.gz -B
<(echo -e "chr22\t16856866\t16856867") | awk -v OFS="\t" '{print
$1,$2,$3,$4,$5}' | grep -v ^##
#CHROM POS ID REF ALT


Here -0 is ignored as it is an indexing option (so as tabix currently
> stands, it would be good if trying to use -0 or other indexing-specific
> options during querying produced an error message).
>

I definitely agree.

However if you store your -R coordinates in a file named test.bed, then
> "tabix -R test.bed vcf" will work as you expect and return two lines. This
> is as documented, as -R interprets the file's contents as BED or as 1-based
> TAB-delimited depending on the filename extension. But there's no way to
> effectively control the filename (if there even is any!) when you're using
> <() or pipes or etc.
>
> This is a pretty good demonstration of why having tools decide how to
> interpret ambiguous [1] file contents based solely on filename extensions
> is unwise. :-)
>

That would explain why I encountered this problem differently in 2 vcf
files. I was filtering a common set of variants from 2 vcf files. One had
chromosome 1, 2, 3,.. and the other had chr1, chr2, chr3,... For the first
file I used "tabix ... -R bed" while for the second I used "tabix ... -R
<(awk '{print "chr"$1,$2,$3}' )"...


>
> So for your use case, it would be good if tabix querying had an option to
> force interpretation of the -R file as BED. (Or if BED were the default
> here, but that ship has sadly sailed.) It would probably be least confusing
> if the name for that option was also -0. So the awkwardness you've
> encountered could be fixed by having tabix also use -0 as a querying option:
>
>   -0, --zero-based, --bed   treat -R (and -T?) file as a BED file
>
>
> OTOH throughout samtools and bcftools this SEQ:START-END way of specifying
> a range of coordinates is always used as a human-readable textual range, so
> it would be confusing to interpret it other than as a 1-based inclusive
> range. So I would not recommend having -0 affect the interpretation of this
> human-readable syntax.
>

Agreed.

My main issue is that the behavior changed from old to newer versions:

Having this bed file:

$ cat test.bed
chr22 16856866 16856867   ## first variant in vcf
chr22 17491114 17491115
chr22 17491115 17491116

And the very same vcf.gz file indexed WITHOUT -0 (tabix -p vcf file.vcf.gz)

Old tabix versions  (tabix 0.2.5 (r1005) ):

No -0, bed file:
$ tabix vcf -B test.bed | awk -v OFS="\t" '{print $1,$2,$3,$4,$5}'
chr22 16856867 . T C
chr22 17491115 . C T
chr22 17491116 . G A

-0, bed file:
$ tabix vcf -B test.bed | awk -v OFS="\t" '{print $1,$2,$3,$4,$5}'
chr22 16856867 . T C
chr22 17491115 . C T
chr22 17491116 . G A

No -0, redirected/manipulated file:
$ tabix vcf -B <(cat test.bed) | awk -v OFS="\t" '{print $1,$2,$3,$4,$5}'
chr22 16856867 . T C
chr22 17491115 . C T
chr22 17491116 . G A

-0, redirected/manipulated file:
$ tabix -0 vcf -B <(cat test.bed) | awk -v OFS="\t" '{print $1,$2,$3,$4,$5}'
chr22 16856867 . T C
chr22 17491115 . C T
chr22 17491116 . G A

-h, bed file:
$ tabix -h vcf -B test.bed | awk -v OFS="\t" '{print $1,$2,$3,$4,$5}' |
grep -v ^#
chr22 17491115 . C T
chr22 17491116 . G A

The file is indexed 1-based, but the -B option treats whatever is passed as
a 0-based BED file.


New versions (tabix 1.8):

No -0, bed file:
$ tabix vcf -R test.bed | awk -v OFS="\t" '{print $1,$2,$3,$4,$5}'
chr22 16856867 . T C
chr22 17491115 . C T
chr22 17491116 . G A

-0, bed file:
$ tabix -0 vcf -R test.bed | awk -v OFS="\t" '{print $1,$2,$3,$4,$5}'
chr22 16856867 . T C
chr22 17491115 . C T
chr22 17491116 . G A

No -0, redirected/manipulated file:
$ tabix vcf -R <(cat test.bed) | awk -v OFS="\t" '{print $1,$2,$3,$4,$5}'
chr22 16856867 . T C
chr22 17491115 . C T
chr22 17491115 . C T
chr22 17491116 . G A

-0, redirected/manipulated file:
$ tabix -0 vcf -R <(cat test.bed) | awk -v OFS="\t" '{print $1,$2,$3,$4,$5}'
chr22 16856867 . T C
chr22 17491115 . C T
chr22 17491115 . C T
chr22 17491116 . G A

-h, bed file:
$ tabix -h vcf -R test.bed | awk -v OFS="\t" '{print $1,$2,$3,$4,$5}'| grep
-v ^##
#CHROM POS ID REF ALT
chr22 16856867 . T C
chr22 17491115 . C T
chr22 17491116 . G A


All this rant and issue is probably my fault for not reading the
documentation of the newer versions and thinking the new -R option was just
a letter change from the old -B version.

There is something about this -0 design decision that I don't fully
understand. Why is -0 tied to the index and not the filter? In which
scenario/use case you have two different regions files (one 0-based bed
file and one 1-based region file) so huge you cannot edit them on the fly
and you create two different versions of each VCF file, one 0-base-indexed
and the other 1-base-indexed? I work with big VCF files (1+ Gb) where I
apply different filters using different bed files to get several flavors of
regions/genes/whatever. It makes no sense to me to have to rebuild the
index file of all the vcfs depending on what region file I am using. It is
even impossible to do if I plan to run simultaneous jobs in a cluster
filtering using different files.

As it stands now, I cannot use tabix and I have to parse those big vcfs
with awk. Is there any version without the -h bug that still has the -B
option?


Thanks for your detailed answer

Txema
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to