[
https://issues.apache.org/jira/browse/KAFKA-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906407#comment-13906407
]
Jay Kreps commented on KAFKA-1253:
----------------------------------
This will be tricky but is possible.
Here are a couple pointers:
1. ByteBuffer.array will give the backing array for bytebuffers so we can work
with apis that only accept arrays
2. GZIPOutputStream requires a stream. Two options:
a. Make an OutputStream implementation based on ByteBuffer.
ByteArrayOutputStream would work but it will be tricky because you would have
to do new ByteArrayOutputStream(size) then use the toByteArray() method to get
the backing array and use ByteBuffer.wrap() on that array to create the
ByteBuffer.
b. Directly use the Deflate compression code java provides and what
ByteArrayOutputStream uses under the covers. This is a better api but there are
some subtly differences between GZIPOutputStream hacks around to get and we
would have to do similar hacking.
3. There are two snappy libraries: we currently use the JNI wrapper for the
google native code, but there is also a pure java impl. Ideally either way
snappy should not be a runtime dependency unless you enable snappy compression.
This will mean not instantiating the classes in the snappy jar unless they are
needed.
4. The desired end result here is that our performance on compressed messages
is comparable to the underlying compression codec and not artificially limited
by lots and lots of byte copying (e.g. see
http://grokbase.com/t/kafka/users/1383bcfkym/compression-performance). For
example snappy claims performance on the order of hundreds of mb/sec. So it
would be good to make a stand-alone main method that runs the message
compression to create compressed messaged and benchmark the performance as well
as look at it in hprof to ensure the time is actually going to compression.
This performance will be particularly important on the server side where we
need to both decompress and recompress and where compression is a big
bottleneck.
> Implement compression in new producer
> -------------------------------------
>
> Key: KAFKA-1253
> URL: https://issues.apache.org/jira/browse/KAFKA-1253
> Project: Kafka
> Issue Type: Sub-task
> Components: producer
> Reporter: Jay Kreps
>
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)