[ 
https://issues.apache.org/jira/browse/HDFS-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144909#comment-13144909
 ] 

jinglong.liujl commented on HDFS-2542:
--------------------------------------

We has implement a prototype which cover 4-TODO before, we use quicklz as 
compress codec.

During compressing, we get some statistics like below:
dfs.block.compress.chunk.size   1M
dfs.block.compress.ratio.min    1.2  
block.compressor.thread.num     1

compress block 132744, before size 8008031402428, after size 2910562972347. 
compress ratio 2.75136
total block 300060, before size 14813811957857, after size 9716343527776. 
compress ratio 1.52462.
compress block/total block : 0.44239 
compress save space : 4.63612 T


                
> Transparent compression storage in HDFS
> ---------------------------------------
>
>                 Key: HDFS-2542
>                 URL: https://issues.apache.org/jira/browse/HDFS-2542
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: jinglong.liujl
>
> As HDFS-2115, we want to provide a mechanism to improve storage usage in hdfs 
> by compression. Different from HDFS-2115, this issue focus on compress 
> storage. Some idea like below:
> To do:
> 1. compress cold data.
>    Cold data: After writing (or last read), data has not touched by anyone 
> for a long time.
>    Hot data: After writing, many client will read it , maybe it'll delele 
> soon.
>    
>    Because hot data compression is not cost-effective,  we only compress cold 
> data. 
>    In some cases, some data in file can be access in high frequency,  but in 
> the same file, some data may be cold data. 
> To distinguish them, we compress in block level.
> 2. compress data which has high compress ratio.
>    To specify high/low compress ratio, we should try to compress data, if 
> compress ratio is too low, we'll never compress them.
> 2. forward compatibility.
>     After compression, data format in datanode has changed. Old client will 
> not access them. To solve this issue, we provide a mechanism which decompress 
> on datanode.
> 3. support random access and append.
>    As HDFS-2115, random access can be support by index. We separate data 
> before compress by fixed-length (we call these fixed-length data as "chunk"), 
> every chunk has its index.
> When random access, we can seek to the nearest index, and read this chunk for 
> precise position.   
> 4. async compress to avoid compression slow down running job.
>    In practice, we found the cluster CPU usage is not uniform. Some clusters 
> are idle at night, and others are idle at afternoon. We should make compress 
> task running in full speed when cluster idle, and in low speed when cluster 
> busy.
> Will do:
> 1. client specific codec and support  compress transmission.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to