[ https://issues.apache.org/jira/browse/FLINK-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16980729#comment-16980729 ]
Piotr Nowojski edited comment on FLINK-14845 at 11/23/19 10:15 AM: ------------------------------------------------------------------- Maybe before going further I have a couple of questions. Where are you going to compress data into? To some unpooled memory? Would you keep a reference count of the buffer that holding uncompressed bytes as long as we are handling the compressed memory? If not, compression could double the amount of buffered in-flight data. If yes, we could double the memory usage for the same amount of buffered in-flight data (this issue might affect pipelined subpartitions differently compared to blocking). What compression algorithm would you like to use? Some library? Will we need to add a new dependency? Will its license be compatible with Apache? Will the compression be configurable? Can user select multiple compression algorithms and compression parameters? Do we want to let user provide his own compression, in other words, make the compressor pluggable? Also what's the timeline of this feature? Would you like it to make for 1.10? was (Author: pnowojski): Maybe before going further I have a couple of questions. Where are you going to compress data into? To some unpooled memory? Would you keep a reference count of the buffer that holding uncompressed bytes as long as we are handling the compressed memory? If not, compression could double the amount of buffered in-flight data. If yes, we could double the memory usage for the same amount of buffered in-flight data (this issue might affect pipelined subpartitions differently compared to blocking). What compression algorithm would you like to use? Some library? Will we need to add a new dependency? Will its license be compatible with Apache? Will the compression be configurable? Can user select multiple compression algorithms and compression parameters? Do we want to let user provide his own compression, in other words, make the compressor pluggable? > Introduce data compression to blocking shuffle. > ----------------------------------------------- > > Key: FLINK-14845 > URL: https://issues.apache.org/jira/browse/FLINK-14845 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Network > Reporter: Yingjie Cao > Assignee: Yingjie Cao > Priority: Major > > Currently, blocking shuffle writer writes raw output data to disk without > compression. For IO bounded scenario, this can be optimized by compressing > the output data. It is better to introduce a compression mechanism and offer > users a config option to let the user decide whether to compress the shuffle > data. Actually, we hava implemented compression in our inner Flink version > and here are some key points: > 1. Where to compress/decompress? > Compressing at upstream and decompressing at downstream. > 2. Which thread do compress/decompress? > Task threads do compress/decompress. > 3. Data compression granularity. > Per buffer. > 4. How to handle that when data size become even bigger after compression? > Give up compression in this case and introduce an extra flag to identify if > the data was compressed, that is, the output may be a mixture of compressed > and uncompressed data. > > We'd like to introduce blocking shuffle data compression to Flink if there > are interests. > -- This message was sent by Atlassian Jira (v8.3.4#803005)