Hello, Is there a way to override the method which is used by TableMapReduceUtil to split a HBase table across several mapper instances when running a Map Reduce over it?
Our current data model is: Row-key: <user_id> Qualifier: <timestamp> Value: <event> The distribution of "events per <user_id>" is long-tailed. Some users have in the order of 10^6 events, whereas the mean is somewhat around 100. These occasional big rows cause problems when running M/R jobs over the table (timeouts, etc.). Also, according to the HBase book they are detrimental to the region splitting process. We currently use server-side filters to not retrieve these large rows but it's not a nice solution. To overcome this, I was thinking about using a "tall and narrow" key design (term from HBase book), meaning in our case that rowkeys are formed by a composite of <user_id>_<tmst>. However, and hence my original question, how would I process this table "user-wise" in a HBase M/R job? Specifically, how do I make sure that a user's history is not split across several Mapper instances? Thank you, /David
