Re: Question regarding region scans in HBase integration

Daniel Einspanjer Sat, 11 Sep 2010 19:27:23 -0700

 Okay, that getSplits part is specifically where my code was involved.

My use case was one of salted rowkeys. We are storing documents thathave a guid as the id and the creation date of the document is importantfor scanning. When we tested having a rowkey format of<creationtimestamp>+<guid>, the RegionServer hotspots becameproblematic, so we decided to salt the rowkey by using the first digitof the guid: <hexchar>+<creationtimestamp>+<guid>. This gives us nicedistribution of inserts throughout the cluster, but of course, it makesscanning a contiguous date range much more complicated.

The code I have allows us to write a MR that takes a list of prefixes(e.g. the hexchar) and a list of ranges (e.g. the desired timestamps)and construct a master Scan object that contains any configuration suchas filters or cache settings, and a series of Scan objects thatconstitute the Cartesian product of the ranges. Then, it passes thosein to a custom getSplits that ensures only the needed regionsparticipate in the Map.

If this sounds like it might be useful, I'll work on getting it cleanedup and posted somewhere so you can review it and maybe glean it forideas. If you are already past that point then I apologize for notchecking into this sooner. :)


-Daniel

On 9/11/10 7:09 PM, John Sichi wrote:

Hi Daniel,

I'm almost done with this for HIVE-1226; the remaining step I need to finish is 
to get the filter passed down during getSplits, since the HBase getSplits 
implementation takes care of figuring out which regions contain the row in 
question.

JVS

On Sep 11, 2010, at 7:00 PM, Daniel Einspanjer wrote:

I was trying to spend a little time this weekend catching up with the current 
state of HBase integration for Hive.  One thing that I haven't seen mentioned 
is how exactly Hive scans an HBase table during a SELECT.

Does Hive have logic that allows it to intelligently scan only the 
participating regions during a SELECT query that uses the rowkey?  If not, I 
recently wrote some code that allows a MapReduce job to effectively select the 
regions based on a list of start/end rowkey ranges.  If this might be useful to 
the Hive integration, I could create a Jira and take a look at trying to set up 
a patch.

Daniel Einspanjer
Metrics Architect
Mozilla Corporation

Re: Question regarding region scans in HBase integration

Reply via email to