[ https://issues.apache.org/jira/browse/HDFS-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15728224#comment-15728224 ]
SammiChen commented on HDFS-7717: --------------------------------- As the task description has said, there are three solutions. 1. existing tool, distcp If user wants to convert a replication directory, he/she can create a new directory, set the desired erasure coding policy on the new directory, then use distcp copy the files to the new directory. This is a feasible solution, but not efficient. Because if some files under the directory are already EC file, the file copy still carry out. 2. a tool like mover a HDFS client tool which takes a directory path and an erasure coding policy as parameters, convert all files which doesn't comply to the specified erasure coding policy to EC file. As a HDFS client, this tool can leverage all existing DFS client striped file read/write logic. So the logic in tool itself will be very simple and generic. It can support convert between replication and erasure coding policy. It can also support convert between different erasure coding polices. No argument here is whether or not leverage Mapreduce. If the tool is a Mapreduce tool, all the erasure coding calculation overhead will be spread evenly on all data nodes. If the tool is only a HDFS tool, then the erasure coding calculation overhead is all on the client node. 3. name node A daemon thread like the one in SPS, which accepts instruction through API, and do the convert decision by going through the directory name space, if a file is found not comply to it erasure coding policy or its inherited parent groups' erasure coding policy, the daemon will mark the file and send task to data node to instruct the data node do the real part converting work. Compared with solution 2, this solution has pros and cons. It adds overhead to name node to do the convert decision. It will cost data node CPU resource to do the EC encoding and decoding. The benefit of this solution is its more easy to use and more automatic. Above are what I thought. Don't know if I missed any important factors other than EC calculation overhead, logic simplicity and easy to use, to make the decision. So I would like to know your advice. [~jzhao], [~andrew.wang], [~drankye], [~umamaheswararao], [~zhz] Personally, I would prefer first start with solution 2 which is more logic simple. > Erasure Coding: distribute replication to EC conversion work to DataNode > ------------------------------------------------------------------------ > > Key: HDFS-7717 > URL: https://issues.apache.org/jira/browse/HDFS-7717 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: Jing Zhao > Assignee: SammiChen > > In *stripping* erasure coding case, we need some approach to distribute > conversion work between replication and stripping erasure coding to DataNode. > It can be NameNode, or a tool utilizing MR just like the current distcp, or > another one like the balancer/mover. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org