[ 
https://issues.apache.org/jira/browse/KAFKA-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Kreps updated KAFKA-595:
----------------------------

         Labels: feature  (was: feature features)
    Description: 
In 0.7 Kafka always appended messages to the log using whatever compression 
codec the client used. In 0.8, after the KAFKA-506 patch, the master always 
recompresses the message before appending to the log to assign ids. Currently 
the server uses a funky heuristic to choose a compression codec based on the 
codecs the producer used. This doesn't actually make that much sense. It would 
be better for the server to have its own compression (a global default and 
per-topic override) that specified the compression codec, and have the server 
always recompress with this codec regardless of the original codec.

Compression currently happens in kafka.log.Log.assignOffsets (perhaps should be 
renamed if it takes on compression as an official responsibility instead of a 
side-effect).

  was:
Compression can be used to store something in less space (less IO) and/or 
transfer it less expensively (better use of network bandwidth). Often the two 
go hand in hand, such as when compressed data is written to a disk: the disk 
I/O takes less time, since less bits are being transferred, and the storage 
occupied on the disk after the transfer is less. Unfortunately, the time to 
compress the data can exceed the savings gained from transferring less data, 
resulting in overall degradation.

After KAFKA-506, the network usage gains we used to get by compressing data at 
the producers is  exceeded by the cost of decompressing and re-compressing data 
at the server side. Compression to save on network costs must be done either to 
reduce the contention in a wide-area network due to multiple point to point 
connections OR to efficiently transfer data over low-bandwidth networks (cross 
DC). In the case of producer-server connections, neither is typically true, 
which means we might not benefit from producer side compression at all in most 
production deployments of Kafka. On the contrary, it might actually hurt 
performance since most production deployments turn on compression for all 
topics.

The main benefit of compressing data in Kafka is to efficiently transfer data 
cross DC for setting up mirrored Kafka clusters. The performance benefit is 
also true for real time consumers, especially when there are multiple groups of 
consumers consuming the same topic. If data is compressed on the server side 
instead, which we do anyways, we can get the I/O savings as well as efficient 
network transfer on the server-consumer links.

I don't have numbers to quantify the performance impact of re-compression now, 
since there are other changes that need to be done to test this correctly.

Thoughts ?

        Summary: Decouple producer side compression from server-side 
compression.  (was: Producer side compression is unnecessary)
    
> Decouple producer side compression from server-side compression.
> ----------------------------------------------------------------
>
>                 Key: KAFKA-595
>                 URL: https://issues.apache.org/jira/browse/KAFKA-595
>             Project: Kafka
>          Issue Type: Improvement
>    Affects Versions: 0.8
>            Reporter: Neha Narkhede
>              Labels: feature
>
> In 0.7 Kafka always appended messages to the log using whatever compression 
> codec the client used. In 0.8, after the KAFKA-506 patch, the master always 
> recompresses the message before appending to the log to assign ids. Currently 
> the server uses a funky heuristic to choose a compression codec based on the 
> codecs the producer used. This doesn't actually make that much sense. It 
> would be better for the server to have its own compression (a global default 
> and per-topic override) that specified the compression codec, and have the 
> server always recompress with this codec regardless of the original codec.
> Compression currently happens in kafka.log.Log.assignOffsets (perhaps should 
> be renamed if it takes on compression as an official responsibility instead 
> of a side-effect).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to