Re: [compress] XZ support and inconsistencies in the existing compressors

Lasse Collin Thu, 04 Aug 2011 04:51:40 -0700

On 2011-08-04 Stefan Bodewig wrote:
> On 2011-08-03, Lasse Collin wrote:
> > I looked at the APIs and code in Commons Compress to see how XZ
> > support could be added. I was especially looking for details where
> > one would need to be careful to make different compressors behave
> > consistently compared to each other.
> 
> This is in a big part due to the history of Commons Compress which
> combined several different codebases with separate APIs and provided a
> first attempt to layer a unifying API on top of it.  We are aware of
> quite a few problems and want to address them in Commons Compress 2.x
> and it would be really great if you would participate in the design of
> the new APIs once that discussion kicks off.


I'm not sure how much I can help, but I can try (depending on how much
I have time).

> > (2) BZip2CompressorOutputStream.flush() calls out.flush() but it
> >     doesn't flush data buffered by BZip2CompressorOutputStream.
> >     Thus not all data written to the Bzip2 stream will be available
> >     in the underlying output stream after flushing. This kind of
> >     flush() implementation doesn't seem very useful.
> 
> Agreed, do you want to open a JIRA issue for this?

There is already this:

    https://issues.apache.org/jira/browse/COMPRESS-42

I tried to understand how flushing could be done properly. I'm not
really familiar with bzip2 so the following might have errors.

I checked libbzip2 and how it's BZ_FLUSH works. It finishes the block,
but it doesn't flush the last bits, and thus the complete block isn't
available in the output stream. The blocks in the .bz2 format aren't
aligned to full bytes, and there is no padding between blocks.

The lack of alignment makes flushing tricky. One may need to write out
up to seven bits of data from the future. The bright side is that those
future bits can only come from the block header magic or from the end
of stream magic. Both are constants so there are only two possibilities
what those seven bits can be.

Using bits from the end of stream magic doesn't make sense, because then
one would be forced to finish the stream. Using the bits from the
block header magic means that one must add at least one more block.
This is fine if the application will want to encode at least one more
byte. If the application calls close() right after flushing, then
there's a problem unless .bz2 format allows empty blocks. I get a
feeling from the code that .bz2 would support empty blocks, but I'm not
sure at all.

Since bzip2 works on blocks that are compressed independently from each
other, the compression ratio doesn't get a big penalty if the stream is
finished and then a new stream is started. This would make it much
simpler to implement flushing. The downside is that implementations,
that don't support decoding concatenated .bz2 files, will stop after
the first stream.

> > (4) The decompressor streams don't support concatenated .gz and .bz2
> >     files. This can be OK when compressed data is used inside
> >     another file format or protocol, but with regular
> >     (standalone) .gz and .bz2 files it is bad to stop after the
> >     first compressed stream and silently ignore the remaining
> >     compressed data.
> 
> >     Fixing this in BZip2CompressorInputStream should be relatively
> >     easy because it stops right after the last byte of the
> >     compressed stream.
> 
> Is this <https://issues.apache.org/jira/browse/COMPRESS-146>?

Yes. I didn't check the suggested fix though.

> >     Fixing GzipCompressorInputStream is harder because the problem
> >     is inherited from java.util.zip.GZIPInputStream which reads
> >     input past the end of the first stream. One might need to
> >     reimplement .gz container support on top of
> >     java.util.zip.InflaterInputStream or java.util.zip.Inflater.
> 
> Sounds doable but would need somebody to code it, I guess ;-)

There is a little bit hackish solution in the comments of the following
bug report, but it lacks license information:

    http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4691425

> In the past we have incorporated external codebases (ar and cpio) that
> used to be under compatible licenses to make things simpler for our
> users, but if you prefer to develop your code base outside of Commons
> Compress then I can fully understand that.

I will develop it in my own tree, but it's possible to include a copy
in Commons Compress with modified "package" and "import" lines in the
source files. Changes in my tree would need to be copied to Commons
Compress now and then. I don't know if this is better than having an
external dependency.

org.tukaani.xz will include features that aren't necessarily interesting
in Commons Compress, for example, advanced compression options and
random access reading. Most developers probably won't care about these.

(The above answers to Simone Tripodi's message too.)

> From the dependency management POV I know many
> developers prefer dependencies that are available from a Maven
> repository, is this the case for the org.tukaani.xz package (I'm too
> lazy to check).

There is only build.xml for Ant.

> Also I would have a problem with an external dependency on code that
> says "The APIs aren't completely stable yet".  Any tentative timeframe
> as to when you expect to have a stable API?  It might match our
> schedule for 2.x so we could target that release rather than 1.3.

It needs to be stable in 2-4 weeks or so. I need to get feedback about
the API first. I think will get some feedback next week. More people
giving feedback would naturally be welcome. ;-)

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [compress] XZ support and inconsistencies in the existing compressors

Reply via email to