[ https://issues.apache.org/jira/browse/CRUNCH-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15145599#comment-15145599 ]
Micah Whitacre commented on CRUNCH-588: --------------------------------------- Worked with Stephen and got this to merge with the "git am --signoff patch --reject --whitespace=fix" command. Will be merging to master shortly. > Modify HFileUtils to flex on affected regions for hfiles, rather than all > regions > --------------------------------------------------------------------------------- > > Key: CRUNCH-588 > URL: https://issues.apache.org/jira/browse/CRUNCH-588 > Project: Crunch > Issue Type: Improvement > Components: Core > Reporter: Stephen Durfey > Assignee: Stephen Durfey > Attachments: crunch-588-0.14-SNAPSHOT.patch, > crunch-588-0.14-SNAPSHOT_v2.patch, hfileutils_0.8.5.patch > > > HFileUtils when preparing for writing HFiles sets the [number of reducers | > https://github.com/apache/crunch/blob/master/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HFileUtils.java#L422] > equal to the number of regions in the table, and then writes out the start > keys for each region to a sequence file for the TotalOrderPartitioner to > consume when partitioning data. This can result in a very large quantity of > reducers that don't do anything due to not having any data to write to hfiles > for the region its partition belonged to. > My proposal is to modify HFileUtils, with an optional parameter (or a config, > that's up for debate) to determine which regions data will be loaded into > ahead of time, and set the number of reducers to equal the number of regions, > and only write out the start keys for those affected regions. > I have working code to do this on the 0.8.x branch of crunch, as that is what > I am currently on. I can modify it to work on more recent versions, but I > wanted to start a discussion around the viability of this code being > contributed back to the community. I am still in process of capturing metrics > around the impact of the change (and trying to get data large enough to test > this out), but at least from a reducer count I have seen substantial drops in > my limited testing so far. For example, I had a job go from 705 reduce tasks > during the write down to 36 reduce tasks. > I've attached what I have so far as of 0.8.4. I'm going to start working on a > version modified for the latest version of crunch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)