[ https://issues.apache.org/jira/browse/CRUNCH-644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997951#comment-15997951 ]
Gabriel Reid commented on CRUNCH-644: ------------------------------------- I take it nobody is wildly against this, so I'll commit it shortly unless I hear otherwise. > Set HDFS node affinity on created HFiles to improve locality > ------------------------------------------------------------ > > Key: CRUNCH-644 > URL: https://issues.apache.org/jira/browse/CRUNCH-644 > Project: Crunch > Issue Type: Improvement > Reporter: Gabriel Reid > Attachments: CRUNCH-644.patch > > > When creating HFiles via the {{HFileUtils.writeToHFilesForIncrementalLoad}} > method, the underlying HDFS blocks of the created HFiles will end up on a > selection of HDFS data nodes -- the selection of which nodes is left up to > the HDFS Namenode. This means that there is a relatively small chance > (depending on cluster size and replication factor) that the created HFiles > will end up on the same physical machine as the region server which will make > use of these HFiles, which limits the ability to use short-circuit reads to > the local file system. Typically, this lack of locality is only really > completely resolved after a major compaction. > It's possible to set a node affinity on HDFS files at creation time, to > provide a suggestion to the namenode about a preferred data node for blocks > to be located on. The intention of this ticket is to make use of this > functionality to set the node affinity during HFile creation in > {{HFileUtils.writeToHFilesForIncrementalLoad}} so that at least one (HDFS) > block of each created HFile will be located on the same physical machine as > the region server which will be using the file (assuming HDFS data nodes are > running on the same machines as HBase region servers). -- This message was sent by Atlassian JIRA (v6.3.15#6346)