Re: [xz-devel] Random access to xz files

Lasse Collin Sun, 23 Jun 2013 12:23:51 -0700

On 2013-06-23 Richard W.M. Jones wrote:
> I'm trying to write an NBD driver for XZ files.  This requires random
> access to the files.
> 
> So far I've have loaded the index from the file, and I'm using
> lzma_index_iter_locate (successfully) to locate the block and
> uncompressed offset that contains the byte of interest.  However I'm
> stuck as to where I go from there.
> 
> I am able to decode the block header using lzma_block_header_decode.
> But should I need to do that?  Isn't the block already "loaded" in the
> index?


The index contains only the sizes of the blocks, so yes, you need to
decode the block header. In the liblzma API it is separated from the
Block decompressor code.

The block header contains information needed to decode the compressed
data in the block. The block header may also contain the same size
information as the index, so it's recommended to check that the index
and block header don't contradict each other.

To the question "XXX Any reason we are doing this?" in xzfile.c: If
compressed size was stored in the block header, calling
lzma_block_compressed_size() validates that it matches the unpadded
size in the index. If compressed size wasn't stored in the block
header, block.compressed_size is set from iter.block.unpadded_size. In
both cases the block decoder will then check that the compressed size
really is what the headers say it should be. If
lzma_block_compressed_size() fails, something is wrong with block
header or index.

I noticed that equivalent check for block.uncompressed_size was missing
from list.c. I have added it now:

    
http://git.tukaani.org/?p=xz.git;a=commitdiff;h=ebb501ec73cecc546c67117dd01b5e33c90bfb4a

Anyway, the point of these checks is just to be pedantic and detect
corrupt files as well as possible. They aren't strictly required.

> I'm also able to read the data from the block (although decoding fails
> at the end of the block -- I don't understand why).

There is a hardcoded value:

    block.check = LZMA_CHECK_NONE;

The value should be taken from iter.stream.flags->check
instead. But to get it from there, it needs to be stored first with
lzma_index_stream_flags(). Look for the call to that function in list.c.

With a hardcoded LZMA_CHECK_NONE you get an error if there is an
integrity check, because the observed block size doesn't match what has
been stored in the block header (or in the index).

Since you appear to want to support .xz files with more than one stream
(that is good), you also need to use lzma_index_stream_padding().
Again, see list.c. If there is padding but its size isn't stored in the
index, seeking will fail because the compressed offsets will be
miscalculated.

A few things I noticed while reading the code:

xzfile.c line 171:

    index_size = footer_flags.backward_size;

index_size is off_t. If someone for some weird reason builds the code
with 32-bit off_t, there is an integer overflow if backward_size (i.e.
size of the index) does fit into 32 bits. It is very unlikely in
practice but the theoretical max value for backward_size is 2^34 i.e.
LZMA_BACKWARD_SIZE_MAX.

xzfile.c line 196:

    strm.avail_in = index_size;
    if (strm.avail_in > BUFSIZ)
      strm.avail_in = BUFSIZ;

If index_size is greater than UINT32_MAX (unlikely) and the code runs on
a 32-bit system where size_t is 32 bits, the above causes trouble if the
lowest 32 bits are all zero (even more unlikely but still possible).

xzfile.c line 211:

    if (r != LZMA_STREAM_END) {
      nbdkit_error ("%s: could not parse index (error %d)",
                    filename, r);

The above should check that the size of the index matches what was
stored in the stream footer. If there's a mismatch, the file is
corrupt. It is good to catch it here so that it doesn't cause
trouble later. Here is a possible fix:

    if (r != LZMA_STREAM_END || index_size != 0 || strm.avail_in != 0) {

xzfile.c lines 389-391, 410-412:

    while (strm.total_out < discard_bytes) {
      uint8_t buf[BUFSIZ];
      uint8_t discard[BUFSIZ * 10];
    ...
    strm.avail_out = sizeof discard;
    strm.next_out = discard;
    r = lzma_code (&strm, LZMA_RUN);

Maybe this loop isn't even meant to be finished yet but just in case:
The above makes an incorrect assumption that lzma_code() will always
consume the whole input buffer. The library is free to buffer as much
data as it wants and thus it's possible that it sometimes provides a lot
of output while consuming little or even no input.

xzfile.c line 410:

    strm.avail_out = sizeof discard;

avail_out should be limited so that the loop won't discard more
than discard_bytes number of bytes.

In many places the code uses read(fd,buf,size) calls with the assumption
that a positive return value less than "size" must mean end of file or
an error. In general read() may return less than "size" without hitting
end of file or error. I don't know if Linux makes extra guarantees
over POSIX when reading from a regular file, but even if it does, I
still wouldn't rely on it.

After those small things I think it should have a good chance to work
once you add code to decompress the requested part of the block into
xzfile_pread(). While that is some work still, don't get discouraged
now: you have the messiest parts mostly done already. Obviously what
you are doing should have been abstracted into nice file I/O library
long ago but so far that doesn't exist.

> is this stuff documented anywhere?

The documentation is poor. The API headers have reference-like docs but
so far there only are example programs for the most basic compression
and decompression, so there are no examples about random access. (I
don't count list.c in xz sources as an example program.)

The liblzma APIs for random access are low level and thus require
a lot of code to use. One also needs to understand the .xz file format
structure. A reason for so low-level APIs is that liblzma takes its
input and gives its output via buffers provided by the application.
Callback functions or file I/O functions aren't used.

My idea was and still is to have a separate file I/O library that would
handle not only .xz files but also uncompressed, .gz, and .bz2 files.
There is some old pre-pre-alpha code in libxzfile.git on
git.tukaani.org, but in its current state it's not interesting since
it's so incomplete and there's almost no compression related code yet.

It is a bit backwards that right now, compared to XZ Utils, XZ for Java
has much cleaner code, better docs, and an *easy*-to-use random-access
decompressor class. On the other hand XZ for Java works on streams
instead of passing *both* input and output via caller-provided buffers.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] Random access to xz files

Reply via email to