Re: replicate data in HDFS with smarter encoding

Da Zheng Mon, 18 Jul 2011 22:37:55 -0700

Hello,

On 07/18/11 21:43, Uma Maheswara Rao G 72686 wrote:

Hi,


We have already thoughts about it.

No, I think we are talking about different problems. What I'm talkingabout is how to reduce the number of replica while still achieving thesame data reliability. The replica of data can already be compressed.


To illustrate the problem, here is a more concrete example:

The size of block A is X. After it is compressed, its size is Y. When itis written to HDFS, it needs to be replicated if we want the data to bereliable. If the replication factor is R, then R*Y bytes will be writtento the disk, and (R-1)*Y bytes will be transmitted in the network.

Now, if we use some better encoding to achieve data reliability, for Bblocks of data, we can have P parity blocks. And for each block, we needto have (1+P/B)*Y bytes written to the disk and P/B*Y bytes transmittedover the network, and thus it's possible to further reduce the networkand disk bandwidth.

So what Joey showed me is more relevant even though it doesn't reducethe data size before data is written to the network or the disk.


To implement that, I think we will probably not use pipeline any more.

Looks like you are talking about this features right
https://issues.apache.org/jira/browse/HDFS-1640
https://issues.apache.org/jira/browse/HDFS-2115

About your patches, I don't know how useful it can be when we can askthe applications to compress data. For example, we can enablemapred.output.compress in MapReduce to ask reducers to compress data. Iassume MapReduce is the major user of HDFS.


Thanks,
Da

Re: replicate data in HDFS with smarter encoding

Reply via email to