[
https://issues.apache.org/jira/browse/HDFS-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006648#comment-16006648
]
Hari Sekhon commented on HDFS-2542:
-----------------------------------
I recall looking for this feature 2-3 years ago while at a large financial and
it's just come up again with another large financial client I'm working for
right now.
I see I actually already upvoted this jira the last time I looked at it but
there has been no movement on this in years.
Having transparent compression on a directory tree would be a very useful
feature.
Is there any chance of this being implemented?
> Transparent compression storage in HDFS
> ---------------------------------------
>
> Key: HDFS-2542
> URL: https://issues.apache.org/jira/browse/HDFS-2542
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: jinglong.liujl
> Attachments: tranparent compress storage.docx
>
>
> As HDFS-2115, we want to provide a mechanism to improve storage usage in hdfs
> by compression. Different from HDFS-2115, this issue focus on compress
> storage. Some idea like below:
> To do:
> 1. compress cold data.
> Cold data: After writing (or last read), data has not touched by anyone
> for a long time.
> Hot data: After writing, many client will read it , maybe it'll delele
> soon.
>
> Because hot data compression is not cost-effective, we only compress cold
> data.
> In some cases, some data in file can be access in high frequency, but in
> the same file, some data may be cold data.
> To distinguish them, we compress in block level.
> 2. compress data which has high compress ratio.
> To specify high/low compress ratio, we should try to compress data, if
> compress ratio is too low, we'll never compress them.
> 2. forward compatibility.
> After compression, data format in datanode has changed. Old client will
> not access them. To solve this issue, we provide a mechanism which decompress
> on datanode.
> 3. support random access and append.
> As HDFS-2115, random access can be support by index. We separate data
> before compress by fixed-length (we call these fixed-length data as "chunk"),
> every chunk has its index.
> When random access, we can seek to the nearest index, and read this chunk for
> precise position.
> 4. async compress to avoid compression slow down running job.
> In practice, we found the cluster CPU usage is not uniform. Some clusters
> are idle at night, and others are idle at afternoon. We should make compress
> task running in full speed when cluster idle, and in low speed when cluster
> busy.
> Will do:
> 1. client specific codec and support compress transmission.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]