I have a question. Now that hadoop have "maper" and "reducer" , how about the solution like Map<Tree(nodepair set), root(node) > and Reduce<root(node), Tree(nodepair set)> or that directly Reduce<root(node), nodepair> here nodepair can be look as a branch...
----- Original Message ----- From: "Konstantin Shvachko" <s...@yahoo-inc.com> To: <common-dev@hadoop.apache.org> Sent: Tuesday, March 02, 2010 10:21 AM Subject: Re: Namespace partitioning using Locality Sensitive Hashing > Symlinks is a brand new feature in HDFS. > You can read about it in > https://issues.apache.org/jira/browse/HDFS-245 > Documentation is here: > https://issues.apache.org/jira/secure/attachment/12434745/design-doc-v4.txt > > Symbolic links in HDFS can point to a directory in a different file system, > particularly on a different HDSF cluster. They are like mount points in this > case. > So you can create a symlink on cluster C1 pointing to the root of cluster C2. > This makes C2 a sub-namespace of C1. > > --Konstantin > > On 3/1/2010 5:42 PM, Ketan Dixit wrote: >> Hello, >> Thank you Konstantin and Allen for your reply. The information >> provided really helped to improve my understanding. >> However I still have few questions. >> How Symlinks/ soft links are used to solve the probem of partitioning. >> (Where do the symlinks point to? All the mapping is >> stored in memory but symlinks point to file objects? This is little >> confusing to me) >> Can you please provide insight into this? >> >> Thanks, >> Ketan >> >> On Mon, Mar 1, 2010 at 3:26 PM, Konstantin Shvachko<s...@yahoo-inc.com> >> wrote: >>> >>> Hi Ketan, >>> >>> AFAIU, hashing is used to map files and directories into different >>> name-nodes. >>> Suppose you use a simple hash function on a file path h(path), and that >>> files >>> with the same hash value (or within a hash range) are mapped to the same >>> name-node. >>> Then files with the same parent will be randomly mapped into different >>> name-nodes: Pr(h(/dir/file1) = h(/dir/file2)) - is small. >>> >>> The ides with LSH is to add some locality factor in the hash function in >>> order >>> to increase probability of placing files from the same directory (or a >>> subtree) >>> into the same name-node. >>> >>> Example 1. >>> Suppose that you apply MD5 only to the path to the parent rather >>> then to the entire file path: h(/root/dir/file) = MD5(/root/dir) >>> Then all files of the same directory will have the same hash value and >>> therefore >>> will be mapped into the same name-node. >>> >>> Example 2. >>> If a path consists of path components pi, where i = 1,..,n >>> Lets define the following hash function: >>> h(/p1/.../pn) = 0, if n< 7 >>> h(/p1/.../pn) = MD5(/p1/.../p7), if n>= 7 >>> With this hash function each subtree rooted at level 7 of the namepsace >>> hierarchy >>> will be entirely mapped to the same name-node. >>> >>> There could be more elaborate examples. >>> >>> Symlinks do provide a way to partition the namespace, as Allen points out, >>> although this is a static partitioning. Static partitions as opposed to >>> dynamic >>> ones do not guarantee that the partitions will be "equal" in size, where >>> "size" >>> may have different meanings (like number of files, or space occupied by the >>> files, or number of blocks). >>> A good hash function need to conform to some equal partitioning requirement. >>> Function from Example 2 would be considered bad in this sense, while >>> Example 1 >>> defines a good function. >>> >>> This is my take on the problem. Hope it makes sense, >>> --Konstantin >>> >>> >>> On 3/1/2010 8:48 AM, Ketan Dixit wrote: >>>> >>>> Hi, >>>> I am a graduate student in Computer Science department at SUNY Stony Brook. >>>> I am thinking of doing a project on Hadoop for my course "Cloud >>>> Computing" >>>> conducted by Prof. Radu Sion. >>>> While going through the links of the "Yahoo open source projects for >>>> students" page I found the idea >>>> "Research on new hashing schemes for filesystem namespace partitioning" >>>> interesting. It looks to me the idea is >>>> to assign subtree of the whole namespace to one namenode and another >>>> subtree >>>> to another namenode. >>>> How LSH is better than normal hashing? Because still, a client or a fixed >>>> namenode has to take decision of which namenode to contact in whatever >>>> hashing ? It looks to me that requests to files under same subtree are >>>> directed to the same namenode then the performance will be faster as the >>>> requests to the same namenode are clustered around the a part of namespace >>>> subtree >>>> (For example a part of on which client is doing some operation.) Is this >>>> assumption correct? Can I have more insight in this regard. >>>> >>>> >>>> >>>> Thanks, >>>> Ketan >>>> >>> >> >