David: The determination of splits is done in TableInputFormatBase.getSplits() where table.getStartEndKeys() is called to get the boundaries of regions. You can take a look and see how you can customize the splits.
bq. how do I make sure that a user's history is not split across several Mapper instances? If events for one user are processed by a single mapper, I think you would continue to see timeouts in your map/reduce job. Cheers On Sun, Jan 6, 2013 at 4:37 AM, David Koch <[email protected]> wrote: > Hello, > > Is there a way to override the method which is used by TableMapReduceUtil > to split a HBase table across several mapper instances when running a Map > Reduce over it? > > Our current data model is: > > Row-key: <user_id> > Qualifier: <timestamp> > Value: <event> > > The distribution of "events per <user_id>" is long-tailed. Some users have > in the order of 10^6 events, whereas the mean is somewhat around 100. > > These occasional big rows cause problems when running M/R jobs over the > table (timeouts, etc.). Also, according to the HBase book they are > detrimental to the region splitting process. We currently use server-side > filters to not retrieve these large rows but it's not a nice solution. > > To overcome this, I was thinking about using a "tall and narrow" key design > (term from HBase book), meaning in our case that rowkeys are formed by a > composite of <user_id>_<tmst>. > > However, and hence my original question, how would I process this table > "user-wise" in a HBase M/R job? Specifically, how do I make sure that a > user's history is not split across several Mapper instances? > > Thank you, > > /David >
