On 13 May 2015, at 12:17, Kanterakis, Efstathios <ekantera...@illumina.com> 
wrote:
> bgzip chr1_h.vcf
> bgzip chr2.vcf
> cat chr1_h.vcf.gz chr2.vcf.gz > test.vcf.gz

...i.e., constructs test.vcf.gz with many BGZF blocks, including an EOF trailer 
block from each of chr1_h.vcf.gz (in the middle of test.vcf.gz) and chr2.vcf.gz 
(at the end of test.vcf.gz).

> tabix test.vcf.gz    # <--
> tabix test.vcf.gz chr2 # blank
> tabix test.vcf.gz chr1 # works
> [...]
> I was under the impression that bgzipped files are directly cat'able. Is this 
> a bug?

As Len suspected, the tabix index command (marked <--) is stopping at the EOF 
trailer block at the end of chr1_h.vcf.gz.  This is an htslib bug: 
https://github.com/samtools/htslib/issues/45 .

See http://sourceforge.net/p/samtools/mailman/message/33493929/ for further 
background.  Nobody considered these EOF blocks and concatenation of bgzipped 
files until rather late in the piece, and both htslib/samtools and 
htsjdk/Picard still have bugs that mean they stop reading at these EOF blocks 
in various circumstances.  The fact that this doesn't cause chaos shows how 
rare this is in practice, and is a large part of the reason why these bugs have 
not been fixed.

Thanks for the IMHO rather plausible use case!  I mostly fixed this in htslib a 
while back, but stopped as the expected utility did not seem to outweigh the 
risk of screwing up error handling in the code in question.  Plausible use 
cases change that calculus.

On 13 May 2015, at 13:52, Peter Cock <p.j.a.c...@googlemail.com> wrote:
> Second, some tools fail
> to cope with concatenated gzip block (e.g. some Java
> libraries break).

This is a separate concern and is not in play here.  Any sizeable bgzipped file 
is already itself a bunch of concatenated gzip/BGZF blocks, so catting two of 
them together makes no difference to the Java library problem.

    John

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to