[ 
https://issues.apache.org/jira/browse/HDFS-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147833#comment-13147833
 ] 

Suresh Srinivas commented on HDFS-2542:
---------------------------------------

HDFS-2115 had lot smaller scope than the problem being solved here.

While the description of the jira starts off the discussion, there are lot of 
details to be covered. Some of the questions I am left with is:
# Post compression, the block files have completely different length. The 
length tracked at NN for the blocks is no longer valid.
# What is the state of the file during compression?
# How do you deal with data that was deemed cold, that could become hot at a 
later point?
# How does Datanode block scanner and directory scanner, internal datanode data 
structures that track block length, Append interact with this feature?

Given that, based on the approach taken, this could result in changes to some 
core parts of HDFS, please write a design document. Alternatively should we 
look at an external tool that can do this analysis and compress the files, 
based on HDFS-2115 mechanism proposed by Todd, to minimize the impact to HDFS 
core code?

                
> Transparent compression storage in HDFS
> ---------------------------------------
>
>                 Key: HDFS-2542
>                 URL: https://issues.apache.org/jira/browse/HDFS-2542
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: jinglong.liujl
>
> As HDFS-2115, we want to provide a mechanism to improve storage usage in hdfs 
> by compression. Different from HDFS-2115, this issue focus on compress 
> storage. Some idea like below:
> To do:
> 1. compress cold data.
>    Cold data: After writing (or last read), data has not touched by anyone 
> for a long time.
>    Hot data: After writing, many client will read it , maybe it'll delele 
> soon.
>    
>    Because hot data compression is not cost-effective,  we only compress cold 
> data. 
>    In some cases, some data in file can be access in high frequency,  but in 
> the same file, some data may be cold data. 
> To distinguish them, we compress in block level.
> 2. compress data which has high compress ratio.
>    To specify high/low compress ratio, we should try to compress data, if 
> compress ratio is too low, we'll never compress them.
> 2. forward compatibility.
>     After compression, data format in datanode has changed. Old client will 
> not access them. To solve this issue, we provide a mechanism which decompress 
> on datanode.
> 3. support random access and append.
>    As HDFS-2115, random access can be support by index. We separate data 
> before compress by fixed-length (we call these fixed-length data as "chunk"), 
> every chunk has its index.
> When random access, we can seek to the nearest index, and read this chunk for 
> precise position.   
> 4. async compress to avoid compression slow down running job.
>    In practice, we found the cluster CPU usage is not uniform. Some clusters 
> are idle at night, and others are idle at afternoon. We should make compress 
> task running in full speed when cluster idle, and in low speed when cluster 
> busy.
> Will do:
> 1. client specific codec and support  compress transmission.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to