[
https://issues.apache.org/jira/browse/HDFS-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147448#comment-13147448
]
jinglong.liujl commented on HDFS-2542:
--------------------------------------
I agree with you.
To classify cold/hot data and store them in lower cost is a generic issue, and
it's always related to application characteristic, so I think we should make
strategy pluggable, and provider a default implement.
> Transparent compression storage in HDFS
> ---------------------------------------
>
> Key: HDFS-2542
> URL: https://issues.apache.org/jira/browse/HDFS-2542
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: jinglong.liujl
>
> As HDFS-2115, we want to provide a mechanism to improve storage usage in hdfs
> by compression. Different from HDFS-2115, this issue focus on compress
> storage. Some idea like below:
> To do:
> 1. compress cold data.
> Cold data: After writing (or last read), data has not touched by anyone
> for a long time.
> Hot data: After writing, many client will read it , maybe it'll delele
> soon.
>
> Because hot data compression is not cost-effective, we only compress cold
> data.
> In some cases, some data in file can be access in high frequency, but in
> the same file, some data may be cold data.
> To distinguish them, we compress in block level.
> 2. compress data which has high compress ratio.
> To specify high/low compress ratio, we should try to compress data, if
> compress ratio is too low, we'll never compress them.
> 2. forward compatibility.
> After compression, data format in datanode has changed. Old client will
> not access them. To solve this issue, we provide a mechanism which decompress
> on datanode.
> 3. support random access and append.
> As HDFS-2115, random access can be support by index. We separate data
> before compress by fixed-length (we call these fixed-length data as "chunk"),
> every chunk has its index.
> When random access, we can seek to the nearest index, and read this chunk for
> precise position.
> 4. async compress to avoid compression slow down running job.
> In practice, we found the cluster CPU usage is not uniform. Some clusters
> are idle at night, and others are idle at afternoon. We should make compress
> task running in full speed when cluster idle, and in low speed when cluster
> busy.
> Will do:
> 1. client specific codec and support compress transmission.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira