On Tue, Feb 10, 2009 at 01:36:21PM -0800, Scott David Daniels wrote: > >A simple way to fix this would be to add a finished attribute to the > >Decompress object. > Perhaps you could submit a patch with such a change?
Yes, I will try and get to that this week. > >However, perhaps this would be a good time to discuss how this library > >works; it is somewhat awkward and perhaps there are other changes which > >would make it cleaner. > Well, it might be improvable, I haven't really looked. I personally > would like it and bz2 to get closer to each other in interface, rather > than to spread out. SO if you are really opening up a can of worms, > I vote for two cans. Well, I like this idea; perhaps this is a good time to discuss the equivalent of some "abstract base classes", or "interfaces", for compression. As I see it, the fundamental abstractions are the stream-oriented de/compression routines. Given those, one should easily be able to implement one-shot de/compression of strings. In fact, that is the way that zlib is implemented; the base functions are the stream-oriented ones and there is a layer on top of convenience functions that do one-shot compression and decompression. After examining the bz2 module, I notice that it has a file-like interface called bz2file, which is roughly analogous to the gzip module. That file interface could form a third API, and basically conform to what python expects of files. So what I suggest is a common framework of three APIs; a sequential compression/decompression API for streams, a layer (potentially generic) on top of those for strings/buffers, and a third API for file-like access. Presumably the file-like access can be implemented on top of the sequential API as well. If the sequential de/compression routines are indeed primitive, and sufficient for the implementation of the other two APIs, then that gives us the option of implementing the other "upper" two layers in pure python, potentially simplifying the amount of extension code that has to be written. I see that as desirable, since it gives us options for writing the upper two layers; in pure python, or by writing extensions to the C code where available. I seem to recall a number of ancilliary functions in zlib, such as those for loading a compression dictionary. There are also options such as flushing the compression in order to be able to resynchronize should part of the archive become garbled. Where these functions are available, they could be implemented, though it would be desirable to give them the same name in each module to allow client code to test for their existence in a compression-agnostic way. For what it's worth, I would rather see a pythonic interface to the libraries than a simple-as-can-be wrapper around the C functions. I personally find it annoying to have to drop down to non-OOP styles in a python program in order to use a C library. It doesn't matter to me whether the OOP layer is added atop the C library in pure python or in the C-to-python binding; that is an implementation detail to me, and I suspect to most python programmers. They don't care, they just want it easy to use from python. If performance turns out to matter, and the underlying compression library supports an "upper layer" in C, then we have the option for using that code. So my suggestion is that we (the python users) brainstorm on how we want the API to look, and not focus on the underlying library except insofar as it informs our discussion of the proper APIs - for example, features such as flushing state, setting compression levels/windows, or for resynchronization points. My further suggestion is that we start with the sequential de/compression, since it seems like a fundamental primitive. De/compressing strings will be trivial, and the file-like interface is already described by Python. So my first suggestion on the stream de/compression API thread is: The sequential de/compression needs to be capable of returning more than just the de/compressed data. It should at least be capable of returning end-of-stream conditions and possibly other states as well. I see a few ways of implementing this: 1) The de/compression object holds state in various members such as data input buffers, data output buffers, and a state for indicating states such as synchronization points or end-of-stream states. Member functions are called and primarily manipulate the data members of the object. 2) The de/compression object has routines for reading de/compressed data and states such as end-of-stream or resynchronization points as exceptions, much like the file class can throw EOFError. My problem with this is that client code has to be cognizant of the possible exceptions that might be thrown, and so one cannot easily add new exceptions should the need arise. For example, if we add an exception to indicate a possible resynchronization point, client code may not be capable of handling it as a non-fatal exception. Thoughts? -- Crypto ergo sum. http://www.subspacefield.org/~travis/ Do unto other faiths as you would have them do unto yours. If you are a spammer, please email j...@subspacefield.org to get blacklisted. -- http://mail.python.org/mailman/listinfo/python-list