[ 
https://issues.apache.org/jira/browse/HDFS-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006648#comment-16006648
 ] 

Hari Sekhon commented on HDFS-2542:
-----------------------------------

I recall looking for this feature 2-3 years ago while at a large financial and 
it's just come up again with another large financial client I'm working for 
right now.

I see I actually already upvoted this jira the last time I looked at it but 
there has been no movement on this in years.

Having transparent compression on a directory tree would be a very useful 
feature.

Is there any chance of this being implemented?

> Transparent compression storage in HDFS
> ---------------------------------------
>
>                 Key: HDFS-2542
>                 URL: https://issues.apache.org/jira/browse/HDFS-2542
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: jinglong.liujl
>         Attachments: tranparent compress storage.docx
>
>
> As HDFS-2115, we want to provide a mechanism to improve storage usage in hdfs 
> by compression. Different from HDFS-2115, this issue focus on compress 
> storage. Some idea like below:
> To do:
> 1. compress cold data.
>    Cold data: After writing (or last read), data has not touched by anyone 
> for a long time.
>    Hot data: After writing, many client will read it , maybe it'll delele 
> soon.
>    
>    Because hot data compression is not cost-effective,  we only compress cold 
> data. 
>    In some cases, some data in file can be access in high frequency,  but in 
> the same file, some data may be cold data. 
> To distinguish them, we compress in block level.
> 2. compress data which has high compress ratio.
>    To specify high/low compress ratio, we should try to compress data, if 
> compress ratio is too low, we'll never compress them.
> 2. forward compatibility.
>     After compression, data format in datanode has changed. Old client will 
> not access them. To solve this issue, we provide a mechanism which decompress 
> on datanode.
> 3. support random access and append.
>    As HDFS-2115, random access can be support by index. We separate data 
> before compress by fixed-length (we call these fixed-length data as "chunk"), 
> every chunk has its index.
> When random access, we can seek to the nearest index, and read this chunk for 
> precise position.   
> 4. async compress to avoid compression slow down running job.
>    In practice, we found the cluster CPU usage is not uniform. Some clusters 
> are idle at night, and others are idle at afternoon. We should make compress 
> task running in full speed when cluster idle, and in low speed when cluster 
> busy.
> Will do:
> 1. client specific codec and support  compress transmission.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to