Re: [Haskell-cafe] Using Data.Binary for compression

2007-11-15 Thread David Roundy
On Thu, Nov 15, 2007 at 11:10:01AM -0800, Chad Scherrer wrote:
> > > Almost all 'real users' just use Codec.Compression.GZip.  It's very
> > > fast, very compositional, and (perhaps suprisingly) almost as
> > > effective as application-specific schemes.
> >
> > I was about to say the same thing. So so much simpler to use Duncan's
> > carefully written zlib binding,
> 
> I have several types of symbols, and for each type the probabilities
> are very predictable - to the point where they could even be
> hard-coded. And upon completion I can be sure the first two questions
> will be "Can we make it smaller?" and "Can we make it faster?". GZip
> (while very cool) is adaptive and general-purpose, so it's building
> frequency tables as it goes and ignoring the structure of the data I
> should be able to take advantage of.
> 
> With an awful lot of trouble, it must be possible to write something
> in C to go faster and yield better compression than gzip for this
> particular data. With the probability structure known in advance,
> there are just a lot of steps taken by gzip that are no longer needed.
> Besides this, gzip only assumes an arbitrary sequence of bytes, but my
> data are much more structured than this.

The catch is that gzip can beat even ideal arithmetic compression, if there
happen to be correlations between symbols.  So what you claim is correct
only if there are no correlations other than those taken into account in
your known probability structure.  Any chance you can tell us what this
mysterious data is?

But bit stream operations (and data compression) are seriously cool in any
case, so I hope you'll go ahead with this!
-- 
David Roundy
Department of Physics
Oregon State University
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Using Data.Binary for compression

2007-11-15 Thread Chad Scherrer
> > Almost all 'real users' just use Codec.Compression.GZip.  It's very
> > fast, very compositional, and (perhaps suprisingly) almost as effective
> > as application-specific schemes.
>
> I was about to say the same thing. So so much simpler to use Duncan's
> carefully written zlib binding,
>
> import Data.Binary
> import Codec.Compression.GZip
> import qualified Data.ByteString.Lazy as L
>
> main = L.writeFile "log.gz" . compress . encode $ [1..10::Int]
>
> Simple, purely functional, fast.
>
> -- Don

I have several types of symbols, and for each type the probabilities
are very predictable - to the point where they could even be
hard-coded. And upon completion I can be sure the first two questions
will be "Can we make it smaller?" and "Can we make it faster?". GZip
(while very cool) is adaptive and general-purpose, so it's building
frequency tables as it goes and ignoring the structure of the data I
should be able to take advantage of.

With an awful lot of trouble, it must be possible to write something
in C to go faster and yield better compression than gzip for this
particular data. With the probability structure known in advance,
there are just a lot of steps taken by gzip that are no longer needed.
Besides this, gzip only assumes an arbitrary sequence of bytes, but my
data are much more structured than this.

Considering the high performance achieved using idiomatic Haskell and
the ByteString and Binary libraries, I would think a similar approach
could be used for writing individual bits. Then it would be
(relatively) easy to write custom compression routines in Haskell with
reasonable performance - I don't think this can be said of any other
language.

Chad
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Using Data.Binary for compression

2007-11-15 Thread David Roundy
On Wed, Nov 14, 2007 at 10:03:52PM -0800, Chad Scherrer wrote:
> I'd like to be able to use Data.Binary (or similar) for compression.
> Say I have an abstract type Symbol, and for each value of Symbol I
> have a representation in terms of some number of bits. For compression
> to be efficient, commonly-used Symbols should have very short
> representations, while less common ones can be longer.

I agree with others that it's probably not worth your effort to do
compression yourself except for fun, but it *is* fun, and interpret my
advice below in that light.  (Also, bitwise operations could be useful for
other things, like interacting with standard formats.  e.g. writing IEEE
doubles portably, something that Data.Binary doesn't do.)

...

> (2) This seems like it will work ok, but the feel is not as clean as
> the current Data.Binary interface. Is there something I'm missing that
> might make it easier to integrate this?

I would write this as a monad, analogous to PutM.  Make bit monads GetBits
and PutBitsM (with PutBits = PutBitsM ()), and then write functions like

writeBits :: PutBits -> Put
readBits :: GetBits a -> Get a

where writeBits and readBits would pad their reading/writing to the next
byte boundary (or perhaps Word32 boundary, for better performance?) as they
must.  So now this would have two main uses: users could use it to
serialize data (e.g. writing an IEEE Double serialization), or could use it
to write their own data compression, but putting *all* the serialization
into the Bits level.

This approach, of course, also would allow you to copy much of the
infrastructure of Binary into your new bit-level interface.

> (3) Right now this is just proof of concept, but eventually I'd like
> to do some performance tuning, and it would be nice to have a
> representation that's amenable to this. Any thoughts on speeding this
> up while keeping the interface reasonably clean would be much
> appreciated.

I think a monad as above would have the advantage of separating the
implementation from the interface, which should make it tuneable.
-- 
David Roundy
Department of Physics
Oregon State University
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Using Data.Binary for compression

2007-11-15 Thread Bulat Ziganshin
Hello Chad,

Thursday, November 15, 2007, 9:03:52 AM, you wrote:

> I'd like to be able to use Data.Binary (or similar) for compression.
> Say I have an abstract type Symbol, and for each value of Symbol I
> have a representation in terms of some number of bits. For compression
> to be efficient, commonly-used Symbols should have very short
> representations, while less common ones can be longer.

alternative may be using naive representation for serializing and then
running zip/bzip2 compression lib over it. it should be both faster
and provide better compression


-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Using Data.Binary for compression

2007-11-14 Thread Don Stewart
stefanor:
> On Wed, Nov 14, 2007 at 10:03:52PM -0800, Chad Scherrer wrote:
> > Hi,
> > 
> > I'd like to be able to use Data.Binary (or similar) for compression.
> > Say I have an abstract type Symbol, and for each value of Symbol I
> > have a representation in terms of some number of bits. For compression
> > to be efficient, commonly-used Symbols should have very short
> > representations, while less common ones can be longer.
> ...
> > (1) Am I reinventing the wheel? I haven't seen anything like this, but
> > it would be nice to be a bit more certain.
> > 
> > (2) This seems like it will work ok, but the feel is not as clean as
> > the current Data.Binary interface. Is there something I'm missing that
> > might make it easier to integrate this?
> > 
> > (3) Right now this is just proof of concept, but eventually I'd like
> > to do some performance tuning, and it would be nice to have a
> > representation that's amenable to this. Any thoughts on speeding this
> > up while keeping the interface reasonably clean would be much
> > appreciated.
> 
> Almost all 'real users' just use Codec.Compression.GZip.  It's very
> fast, very compositional, and (perhaps suprisingly) almost as effective
> as application-specific schemes.

I was about to say the same thing. So so much simpler to use Duncan's
carefully written zlib binding,

import Data.Binary
import Codec.Compression.GZip
import qualified Data.ByteString.Lazy as L

main = L.writeFile "log.gz" . compress . encode $ [1..10::Int]

Simple, purely functional, fast.

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Using Data.Binary for compression

2007-11-14 Thread Stefan O'Rear
On Wed, Nov 14, 2007 at 10:03:52PM -0800, Chad Scherrer wrote:
> Hi,
> 
> I'd like to be able to use Data.Binary (or similar) for compression.
> Say I have an abstract type Symbol, and for each value of Symbol I
> have a representation in terms of some number of bits. For compression
> to be efficient, commonly-used Symbols should have very short
> representations, while less common ones can be longer.
...
> (1) Am I reinventing the wheel? I haven't seen anything like this, but
> it would be nice to be a bit more certain.
> 
> (2) This seems like it will work ok, but the feel is not as clean as
> the current Data.Binary interface. Is there something I'm missing that
> might make it easier to integrate this?
> 
> (3) Right now this is just proof of concept, but eventually I'd like
> to do some performance tuning, and it would be nice to have a
> representation that's amenable to this. Any thoughts on speeding this
> up while keeping the interface reasonably clean would be much
> appreciated.

Almost all 'real users' just use Codec.Compression.GZip.  It's very
fast, very compositional, and (perhaps suprisingly) almost as effective
as application-specific schemes.

Stefan


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Using Data.Binary for compression

2007-11-14 Thread Chad Scherrer
Hi,

I'd like to be able to use Data.Binary (or similar) for compression.
Say I have an abstract type Symbol, and for each value of Symbol I
have a representation in terms of some number of bits. For compression
to be efficient, commonly-used Symbols should have very short
representations, while less common ones can be longer.

Since an encoding like [Bool] would be really inefficient for this (at
least I think it would, though some fancy fusion tricks might be able
to help), I was thinking a reasonable approach might be to use Word8
(for example), and then specify a number of bits n, indicating that
only the first n bits are to be written to the compressed
representation.

I was looking at the internals of Data.Binary, and saw it seems that
PutM could be used for this purpose (was something like this its
original purpose?). Today, I put this together:

type BitRep = Word8
type NBits = Int

type MyBits = (BitRep, NBits)

(#) :: MyBits -> MyBits -> PutM MyBits
(a, m) # (b, n) = case (a .|. (b `shiftR` m), m + n) of
  ans@(ab, s) -> if s < 8 then return ans
else putWord8 ab >> return (b `shiftL` (8 - m), s - 8)

Then, it would be easy enough to map [Symbol] -> [MyBits], and then
use something like foldM (#) to get into the PutM monad.

A couple of questions:

(1) Am I reinventing the wheel? I haven't seen anything like this, but
it would be nice to be a bit more certain.

(2) This seems like it will work ok, but the feel is not as clean as
the current Data.Binary interface. Is there something I'm missing that
might make it easier to integrate this?

(3) Right now this is just proof of concept, but eventually I'd like
to do some performance tuning, and it would be nice to have a
representation that's amenable to this. Any thoughts on speeding this
up while keeping the interface reasonably clean would be much
appreciated.

Thanks!

-Chad
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe