This is an astute observation. The key constraint is that we can't send half a compressed message set because the nature of most compression requires processing the full compressed data block from the beginning. How we handle that depends on the version of kafka.
In 0.7 messages inside a compressed message set effectively have no offset. The only offset that can be committed is on the boundaries of these sets. This is not ideal as you can get duplicates within a message set. In 0.8 we have fixed this. Each message has an offset, and it is valid to request any offset from the server. If you request an offset that falls in the middle of a compressed message set the server will send you that full message set because otherwise you can't decompress it, but the client, if it is smart can and should discard the messages prior to its fetch offset. The java and scala client do this automatically. Cheers, -Jay On Mon, Nov 19, 2012 at 12:45 PM, David Arthur <mum...@gmail.com> wrote: > I'm working on offset management for my Python client (non-ZK). I'm having > trouble seeing how you would keep track of the message offset when using > compression. As I understand it, when you use compression, you concatenate > many messages together and then compress the resulting encoded MessageSet. > How could you possibly keep track of the message offsets when doing this? > As best as I can figure, you can only determine commit offsets for > top-level messages - not nested ones (as you get with compression). > > Thanks! > -David > >