Re: [Samtools-help] Bug in tabix: ignores 0-based (-0) option.

Heredia Genestar, Txema Tue, 17 Apr 2018 06:41:52 -0700

After further reading the documentation I've finally found the part where
it explains the -0 option.


I was mistaken and the -0 option is not there if you fancy one or another
indexing. It is there for the ¿rare? cases where your vcf file data is
written using 0-based positions and you need proper indexing.

My apologies.

Txema

2018-04-17 14:00 GMT+02:00 Heredia Genestar, Txema <txema.here...@upf.edu>:

> Woah, that's worse than I thought then!
>
> The -0 option is functioning as designed. Like the other options
>> specifying the columns of interest within the data file to be indexed (-b,
>> -e, -p, etc), -0 is only effective when tabix is being used to create an
>> index for a data file, not when querying via an existing index.
>>
>> It's fairly clear in the documentation that -0 doesn't affect the
>> interpretation of *query* coordinates, whether in a -R file or as
>> command-line arguments.
>>
>
> I was coming from using tabix 0.2.5, where -0 is thrown in in the middle
> of all options, with no mention to indexing only, so I supposed it applied
> to your queries and not indexing. I began using newer versions (HTSLIB 1.5)
> because old stand-alone tabix versions have a bug when combining -h and -B
> options, where the first variant of the vcf is not returned
>
> $ tabix -h ../chimp/vcf/vcf/chimp.chr22.filtered_chimp_gorilla.vcf.gz
> chr22:16856866-16856867 | awk -v OFS="\t" '{print $1,$2,$3,$4,$5}' | grep
> -v ^##
> #CHROM POS ID REF ALT
> chr22 16856867 . T C
>
> vs
>
> $ tabix -h ../chimp/vcf/vcf/chimp.chr22.filtered_chimp_gorilla.vcf.gz -B
> <(echo -e "chr22\t16856866\t16856867") | awk -v OFS="\t" '{print
> $1,$2,$3,$4,$5}' | grep -v ^##
> #CHROM POS ID REF ALT
>
>
> Here -0 is ignored as it is an indexing option (so as tabix currently
>> stands, it would be good if trying to use -0 or other indexing-specific
>> options during querying produced an error message).
>>
>
> I definitely agree.
>
> However if you store your -R coordinates in a file named test.bed, then
>> "tabix -R test.bed vcf" will work as you expect and return two lines. This
>> is as documented, as -R interprets the file's contents as BED or as 1-based
>> TAB-delimited depending on the filename extension. But there's no way to
>> effectively control the filename (if there even is any!) when you're using
>> <() or pipes or etc.
>>
>> This is a pretty good demonstration of why having tools decide how to
>> interpret ambiguous [1] file contents based solely on filename extensions
>> is unwise. :-)
>>
>
> That would explain why I encountered this problem differently in 2 vcf
> files. I was filtering a common set of variants from 2 vcf files. One had
> chromosome 1, 2, 3,.. and the other had chr1, chr2, chr3,... For the first
> file I used "tabix ... -R bed" while for the second I used "tabix ... -R
> <(awk '{print "chr"$1,$2,$3}' )"...
>
>
>>
>> So for your use case, it would be good if tabix querying had an option to
>> force interpretation of the -R file as BED. (Or if BED were the default
>> here, but that ship has sadly sailed.) It would probably be least confusing
>> if the name for that option was also -0. So the awkwardness you've
>> encountered could be fixed by having tabix also use -0 as a querying option:
>>
>>   -0, --zero-based, --bed   treat -R (and -T?) file as a BED file
>>
>>
>> OTOH throughout samtools and bcftools this SEQ:START-END way of
>> specifying a range of coordinates is always used as a human-readable
>> textual range, so it would be confusing to interpret it other than as a
>> 1-based inclusive range. So I would not recommend having -0 affect the
>> interpretation of this human-readable syntax.
>>
>
> Agreed.
>
> My main issue is that the behavior changed from old to newer versions:
>
> Having this bed file:
>
> $ cat test.bed
> chr22 16856866 16856867   ## first variant in vcf
> chr22 17491114 17491115
> chr22 17491115 17491116
>
> And the very same vcf.gz file indexed WITHOUT -0 (tabix -p vcf file.vcf.gz)
>
> Old tabix versions  (tabix 0.2.5 (r1005) ):
>
> No -0, bed file:
> $ tabix vcf -B test.bed | awk -v OFS="\t" '{print $1,$2,$3,$4,$5}'
> chr22 16856867 . T C
> chr22 17491115 . C T
> chr22 17491116 . G A
>
> -0, bed file:
> $ tabix vcf -B test.bed | awk -v OFS="\t" '{print $1,$2,$3,$4,$5}'
> chr22 16856867 . T C
> chr22 17491115 . C T
> chr22 17491116 . G A
>
> No -0, redirected/manipulated file:
> $ tabix vcf -B <(cat test.bed) | awk -v OFS="\t" '{print $1,$2,$3,$4,$5}'
> chr22 16856867 . T C
> chr22 17491115 . C T
> chr22 17491116 . G A
>
> -0, redirected/manipulated file:
> $ tabix -0 vcf -B <(cat test.bed) | awk -v OFS="\t" '{print
> $1,$2,$3,$4,$5}'
> chr22 16856867 . T C
> chr22 17491115 . C T
> chr22 17491116 . G A
>
> -h, bed file:
> $ tabix -h vcf -B test.bed | awk -v OFS="\t" '{print $1,$2,$3,$4,$5}' |
> grep -v ^#
> chr22 17491115 . C T
> chr22 17491116 . G A
>
> The file is indexed 1-based, but the -B option treats whatever is passed
> as a 0-based BED file.
>
>
> New versions (tabix 1.8):
>
> No -0, bed file:
> $ tabix vcf -R test.bed | awk -v OFS="\t" '{print $1,$2,$3,$4,$5}'
> chr22 16856867 . T C
> chr22 17491115 . C T
> chr22 17491116 . G A
>
> -0, bed file:
> $ tabix -0 vcf -R test.bed | awk -v OFS="\t" '{print $1,$2,$3,$4,$5}'
> chr22 16856867 . T C
> chr22 17491115 . C T
> chr22 17491116 . G A
>
> No -0, redirected/manipulated file:
> $ tabix vcf -R <(cat test.bed) | awk -v OFS="\t" '{print $1,$2,$3,$4,$5}'
> chr22 16856867 . T C
> chr22 17491115 . C T
> chr22 17491115 . C T
> chr22 17491116 . G A
>
> -0, redirected/manipulated file:
> $ tabix -0 vcf -R <(cat test.bed) | awk -v OFS="\t" '{print
> $1,$2,$3,$4,$5}'
> chr22 16856867 . T C
> chr22 17491115 . C T
> chr22 17491115 . C T
> chr22 17491116 . G A
>
> -h, bed file:
> $ tabix -h vcf -R test.bed | awk -v OFS="\t" '{print $1,$2,$3,$4,$5}'|
> grep -v ^##
> #CHROM POS ID REF ALT
> chr22 16856867 . T C
> chr22 17491115 . C T
> chr22 17491116 . G A
>
>
> All this rant and issue is probably my fault for not reading the
> documentation of the newer versions and thinking the new -R option was just
> a letter change from the old -B version.
>
> There is something about this -0 design decision that I don't fully
> understand. Why is -0 tied to the index and not the filter? In which
> scenario/use case you have two different regions files (one 0-based bed
> file and one 1-based region file) so huge you cannot edit them on the fly
> and you create two different versions of each VCF file, one 0-base-indexed
> and the other 1-base-indexed? I work with big VCF files (1+ Gb) where I
> apply different filters using different bed files to get several flavors of
> regions/genes/whatever. It makes no sense to me to have to rebuild the
> index file of all the vcfs depending on what region file I am using. It is
> even impossible to do if I plan to run simultaneous jobs in a cluster
> filtering using different files.
>
> As it stands now, I cannot use tabix and I have to parse those big vcfs
> with awk. Is there any version without the -h bug that still has the -B
> option?
>
>
> Thanks for your detailed answer
>
> Txema
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Re: [Samtools-help] Bug in tabix: ignores 0-based (-0) option.

Reply via email to