[ https://issues.apache.org/jira/browse/FLINK-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979002#comment-16979002 ]
Yingjie Cao commented on FLINK-14845: ------------------------------------- [~pnowojski] , thanks for proposing these options. # According to [~lzljs3620320]'s comment, there seems much work to do if we add compression to table/sql layer. # Currently, my POC implementation adopts this option, though as you have pointed out, this could make the system less complete. # As far as I know, most of the compression algorithms act on a window of data, for example 32k or 64k. If we want to implement a continuous compression algorithm which takes in records and gives out compressed record immediately (dose not produce the compressed record immediately may influence latency and invalid the flush mechanism), then we must consider how to update the meta (the 'dict' for decompression) and let the down stream know how to decompress. As you have pointed out, that is complicated. # If we let the compression happen insideBufferConsumer#build, then for Pipelined mode, netty threads do the compression and for Blocking mode, task thread do the compression. If we don't which thread do the compression (not a problem for me), it's a good choice and can work for both Pipelined and Blocking mode. >From my perspective of view, I prefer the forth one. What do you think? >[~pnowojski] > Introduce data compression to blocking shuffle. > ----------------------------------------------- > > Key: FLINK-14845 > URL: https://issues.apache.org/jira/browse/FLINK-14845 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Network > Reporter: Yingjie Cao > Assignee: Yingjie Cao > Priority: Major > > Currently, blocking shuffle writer writes raw output data to disk without > compression. For IO bounded scenario, this can be optimized by compressing > the output data. It is better to introduce a compression mechanism and offer > users a config option to let the user decide whether to compress the shuffle > data. Actually, we hava implemented compression in our inner Flink version > and here are some key points: > 1. Where to compress/decompress? > Compressing at upstream and decompressing at downstream. > 2. Which thread do compress/decompress? > Task threads do compress/decompress. > 3. Data compression granularity. > Per buffer. > 4. How to handle that when data size become even bigger after compression? > Give up compression in this case and introduce an extra flag to identify if > the data was compressed, that is, the output may be a mixture of compressed > and uncompressed data. > > We'd like to introduce blocking shuffle data compression to Flink if there > are interests. > -- This message was sent by Atlassian Jira (v8.3.4#803005)