[ 
https://issues.apache.org/jira/browse/HDFS-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147141#comment-13147141
 ] 

jinglong.liujl commented on HDFS-2542:
--------------------------------------

To Tim:
   Absolutely, efficiency of compression depend on codec and data to be 
compress. In the first step, we can use a specify codec on prototype. In the 
future, we can use right codec for different data in an self-adaption way, but 
I have no idea in implement it in an effective way yet.
   In our prototype,we decide "when to compress" in two ways. 
   1. data xceiver number and number of compressing task. 
      When a datanode has a high data xceiver number, it always means it've to 
serve for many client request(include balance/block replication).At this time, 
I think, compression is not a very urgent task, so it can be slow down, and 
release resource for computing task.
   2. We make compression as a single process, and make it running as idle CPU 
class. In this way, when some CPU-intensive job coming, compression task can 
release CPU slice to job, and when our cluster idle, compression can work in 
full speed.

 >> IN any event, I don't think it is a given that compression of hot data will 
 >> always be inefficient in all codecs for all hardware for all users at all 
 >> times.

  It's right, compression before upload can save bandwidth and reduce 
transmission cost, but it'll slow down running job. It's a trade off. In our 
cluster, CPU utilization isn't mean in every time, so use idle time to make 
compression is valuable. To reduce transmission cost, we'll support compression 
on client as while.

To Robert:
   Absolutely, detection of hot/cold data is really an important thing.To 
distinguish them, we add atime in block level."Atime" will be updated only when 
any DFSClient read it, and block replication,block scanner or re-balance should 
not modify it.This value will be store in disk, to avoid atime loss when 
datanode restart.
   Back to cold/hot data topic, we can make many improvements for different 
application. For example, if we use hdfs as an image storage, and hot image can 
be accessed for thousands of time in a second, we can use SSD to reduce 
latency, and use sata disk for cold data for cost-effective. 
   Currently, in our hadoop cluster, low latency is not a very important thing, 
so for cost-effective, we have not made any improvements for hot data.But for 
cold data, I think , compression + RaidNode + cheaper  disk is a feasible way 
to limit storage cost. 
                
> Transparent compression storage in HDFS
> ---------------------------------------
>
>                 Key: HDFS-2542
>                 URL: https://issues.apache.org/jira/browse/HDFS-2542
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: jinglong.liujl
>
> As HDFS-2115, we want to provide a mechanism to improve storage usage in hdfs 
> by compression. Different from HDFS-2115, this issue focus on compress 
> storage. Some idea like below:
> To do:
> 1. compress cold data.
>    Cold data: After writing (or last read), data has not touched by anyone 
> for a long time.
>    Hot data: After writing, many client will read it , maybe it'll delele 
> soon.
>    
>    Because hot data compression is not cost-effective,  we only compress cold 
> data. 
>    In some cases, some data in file can be access in high frequency,  but in 
> the same file, some data may be cold data. 
> To distinguish them, we compress in block level.
> 2. compress data which has high compress ratio.
>    To specify high/low compress ratio, we should try to compress data, if 
> compress ratio is too low, we'll never compress them.
> 2. forward compatibility.
>     After compression, data format in datanode has changed. Old client will 
> not access them. To solve this issue, we provide a mechanism which decompress 
> on datanode.
> 3. support random access and append.
>    As HDFS-2115, random access can be support by index. We separate data 
> before compress by fixed-length (we call these fixed-length data as "chunk"), 
> every chunk has its index.
> When random access, we can seek to the nearest index, and read this chunk for 
> precise position.   
> 4. async compress to avoid compression slow down running job.
>    In practice, we found the cluster CPU usage is not uniform. Some clusters 
> are idle at night, and others are idle at afternoon. We should make compress 
> task running in full speed when cluster idle, and in low speed when cluster 
> busy.
> Will do:
> 1. client specific codec and support  compress transmission.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to