On 2013-06-23 Richard W.M. Jones wrote: > I'm trying to write an NBD driver for XZ files. This requires random > access to the files. > > So far I've have loaded the index from the file, and I'm using > lzma_index_iter_locate (successfully) to locate the block and > uncompressed offset that contains the byte of interest. However I'm > stuck as to where I go from there. > > I am able to decode the block header using lzma_block_header_decode. > But should I need to do that? Isn't the block already "loaded" in the > index?
The index contains only the sizes of the blocks, so yes, you need to decode the block header. In the liblzma API it is separated from the Block decompressor code. The block header contains information needed to decode the compressed data in the block. The block header may also contain the same size information as the index, so it's recommended to check that the index and block header don't contradict each other. To the question "XXX Any reason we are doing this?" in xzfile.c: If compressed size was stored in the block header, calling lzma_block_compressed_size() validates that it matches the unpadded size in the index. If compressed size wasn't stored in the block header, block.compressed_size is set from iter.block.unpadded_size. In both cases the block decoder will then check that the compressed size really is what the headers say it should be. If lzma_block_compressed_size() fails, something is wrong with block header or index. I noticed that equivalent check for block.uncompressed_size was missing from list.c. I have added it now: http://git.tukaani.org/?p=xz.git;a=commitdiff;h=ebb501ec73cecc546c67117dd01b5e33c90bfb4a Anyway, the point of these checks is just to be pedantic and detect corrupt files as well as possible. They aren't strictly required. > I'm also able to read the data from the block (although decoding fails > at the end of the block -- I don't understand why). There is a hardcoded value: block.check = LZMA_CHECK_NONE; The value should be taken from iter.stream.flags->check instead. But to get it from there, it needs to be stored first with lzma_index_stream_flags(). Look for the call to that function in list.c. With a hardcoded LZMA_CHECK_NONE you get an error if there is an integrity check, because the observed block size doesn't match what has been stored in the block header (or in the index). Since you appear to want to support .xz files with more than one stream (that is good), you also need to use lzma_index_stream_padding(). Again, see list.c. If there is padding but its size isn't stored in the index, seeking will fail because the compressed offsets will be miscalculated. A few things I noticed while reading the code: xzfile.c line 171: index_size = footer_flags.backward_size; index_size is off_t. If someone for some weird reason builds the code with 32-bit off_t, there is an integer overflow if backward_size (i.e. size of the index) does fit into 32 bits. It is very unlikely in practice but the theoretical max value for backward_size is 2^34 i.e. LZMA_BACKWARD_SIZE_MAX. xzfile.c line 196: strm.avail_in = index_size; if (strm.avail_in > BUFSIZ) strm.avail_in = BUFSIZ; If index_size is greater than UINT32_MAX (unlikely) and the code runs on a 32-bit system where size_t is 32 bits, the above causes trouble if the lowest 32 bits are all zero (even more unlikely but still possible). xzfile.c line 211: if (r != LZMA_STREAM_END) { nbdkit_error ("%s: could not parse index (error %d)", filename, r); The above should check that the size of the index matches what was stored in the stream footer. If there's a mismatch, the file is corrupt. It is good to catch it here so that it doesn't cause trouble later. Here is a possible fix: if (r != LZMA_STREAM_END || index_size != 0 || strm.avail_in != 0) { xzfile.c lines 389-391, 410-412: while (strm.total_out < discard_bytes) { uint8_t buf[BUFSIZ]; uint8_t discard[BUFSIZ * 10]; ... strm.avail_out = sizeof discard; strm.next_out = discard; r = lzma_code (&strm, LZMA_RUN); Maybe this loop isn't even meant to be finished yet but just in case: The above makes an incorrect assumption that lzma_code() will always consume the whole input buffer. The library is free to buffer as much data as it wants and thus it's possible that it sometimes provides a lot of output while consuming little or even no input. xzfile.c line 410: strm.avail_out = sizeof discard; avail_out should be limited so that the loop won't discard more than discard_bytes number of bytes. In many places the code uses read(fd,buf,size) calls with the assumption that a positive return value less than "size" must mean end of file or an error. In general read() may return less than "size" without hitting end of file or error. I don't know if Linux makes extra guarantees over POSIX when reading from a regular file, but even if it does, I still wouldn't rely on it. After those small things I think it should have a good chance to work once you add code to decompress the requested part of the block into xzfile_pread(). While that is some work still, don't get discouraged now: you have the messiest parts mostly done already. Obviously what you are doing should have been abstracted into nice file I/O library long ago but so far that doesn't exist. > is this stuff documented anywhere? The documentation is poor. The API headers have reference-like docs but so far there only are example programs for the most basic compression and decompression, so there are no examples about random access. (I don't count list.c in xz sources as an example program.) The liblzma APIs for random access are low level and thus require a lot of code to use. One also needs to understand the .xz file format structure. A reason for so low-level APIs is that liblzma takes its input and gives its output via buffers provided by the application. Callback functions or file I/O functions aren't used. My idea was and still is to have a separate file I/O library that would handle not only .xz files but also uncompressed, .gz, and .bz2 files. There is some old pre-pre-alpha code in libxzfile.git on git.tukaani.org, but in its current state it's not interesting since it's so incomplete and there's almost no compression related code yet. It is a bit backwards that right now, compared to XZ Utils, XZ for Java has much cleaner code, better docs, and an *easy*-to-use random-access decompressor class. On the other hand XZ for Java works on streams instead of passing *both* input and output via caller-provided buffers. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode