Hi Lasse and welcome On 2011-08-03, Lasse Collin wrote:
> I have been working on XZ data compression implementation in Java > <http://tukaani.org/xz/java.html>. I was told that it could be nice > to get XZ support into Commons Compress. Sounds interesting. > I looked at the APIs and code in Commons Compress to see how XZ > support could be added. I was especially looking for details where > one would need to be careful to make different compressors behave > consistently compared to each other. This is in a big part due to the history of Commons Compress which combined several different codebases with separate APIs and provided a first attempt to layer a unifying API on top of it. We are aware of quite a few problems and want to address them in Commons Compress 2.x and it would be really great if you would participate in the design of the new APIs once that discussion kicks off. Right now I myself am pretty busy implementing ZIP64 support for a 1.3 release of Commons Compress and intend to start the 2.x discussion once this is done - which is (combined with some scheduled offline time) about a month away for me. I should probably also mention that right now probably no active committer understands the bzip2 code well enough to make significant changes at all. I know that I don't. > I found a few possible problems in the existing code: > (1) CompressorOutputStream should have finish(). Now > BZip2CompressorOutputStream has finish() but > GzipCompressorOutputStream doesn't. This should be easy to > fix because java.util.zip.GZIPOutputStream supports finish(). +1 This is a good point we should earmark for 2.0 - doing so for 1.x would break the API which we try to avoid. > (2) BZip2CompressorOutputStream.flush() calls out.flush() but it > doesn't flush data buffered by BZip2CompressorOutputStream. > Thus not all data written to the Bzip2 stream will be available > in the underlying output stream after flushing. This kind of > flush() implementation doesn't seem very useful. Agreed, do you want to open a JIRA issue for this? > GzipCompressorOutputStream.flush() is the default version > from InputStream and thus does nothing. Adding flush() > into GzipCompressorOutputStream is hard because > java.util.zip.GZIPOutputStream and java.util.zip.Deflater don't > support sync flushing before Java 7. To get Gzip flushing in > older Java versions one might need a complete reimplementation > of the Deflate algorithm which isn't necessarily practical. Not really desirable, I agree. As for Java7, we currently target Java5 but it might be possible to hack in flush support using reflection. So we could support sync flushing if the current Java classlib supports it. > (3) BZip2CompressorOutputStream has finalize() that finishes a stream > that hasn't been explicitly finished or closed. This doesn't seem > useful. GzipCompressorOutputStream doesn't have an equivalent > finalize(). Removing it could cause backwards compatibility issues. I agree it is unnecessary but would leave fixing it to the point where we are willing to break compatibility - i.e. 2.0. This is in the same category as <https://issues.apache.org/jira/browse/COMPRESS-128> to me. > (4) The decompressor streams don't support concatenated .gz and .bz2 > files. This can be OK when compressed data is used inside another > file format or protocol, but with regular (standalone) .gz and > .bz2 files it is bad to stop after the first compressed stream > and silently ignore the remaining compressed data. > Fixing this in BZip2CompressorInputStream should be relatively > easy because it stops right after the last byte of the compressed > stream. Is this <https://issues.apache.org/jira/browse/COMPRESS-146>? > Fixing GzipCompressorInputStream is harder because the problem is > inherited from java.util.zip.GZIPInputStream which reads input > past the end of the first stream. One might need to reimplement > .gz container support on top of java.util.zip.InflaterInputStream > or java.util.zip.Inflater. Sounds doable but would need somebody to code it, I guess ;-) > The XZ compressor supports finish() and flush(). The XZ decompressor > supports concatenated .xz files, but there is also a single-stream > version that behaves similarly to the current version of > BZip2CompressorInputStream. I think in the 1.x timeframe users that know they are using XZ would simply bypass the Commons Compress interfaces like they'd do now if they wanted to flush the bzip2 stream. The main difference here likely is they wouldn't need to use Commons Compress at all but could be using your XZ package directly in that case. They don't have that choice with bzip2. > Assuming that there will be some interest in adding XZ support into > Commons Compress, is it OK make Commons Compress depend on the XZ > package org.tukaani.xz, or should the XZ code be modified so that > it could be included as an internal part in Commons Compress? > I would prefer depending on org.tukaani.xz because then there is just > one code base to keep up to date. In the past we have incorporated external codebases (ar and cpio) that used to be under compatible licenses to make things simpler for our users, but if you prefer to develop your code base outside of Commons Compress then I can fully understand that. >From a license POV we obviously wouldn't have any problems with your public domain code. From the dependency management POV I know many developers prefer dependencies that are available from a Maven repository, is this the case for the org.tukaani.xz package (I'm too lazy to check). I'm an Ant person myself, but you know there are those people who love repositories ... Also I would have a problem with an external dependency on code that says "The APIs aren't completely stable yet". Any tentative timeframe as to when you expect to have a stable API? It might match our schedule for 2.x so we could target that release rather than 1.3. Cheers Stefan --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org