Nick, many thanks for the pointer. Yeah, the TableInputFormat looks fit my needs. I will dig into it. Appreciate the help
Demai On Wed, Feb 4, 2015 at 8:13 AM, Nick Dimiduk <ndimi...@gmail.com> wrote: > Sounds like you're wanting to do a lot of what the TableInputFormat > facilitates for mapreduce programs. Probably you can use code from that > package to turn a Scan into input splits, which contain region name > and RegionServer location, and consume those from your custom coordinator. > > -n > > On Tuesday, February 3, 2015, Demai Ni <nid...@gmail.com> wrote: > > > hi, Guys, > > > > I am looking for a way to Read HBase table through MPP(Postgres-XC). And > > hoping to get some suggestions to either validate or invalidate the > > approach. > > > > Kind of like Apache Drill, but through PostgresSQL. Long story about why > > Postgres, and how c/c++ will give me headache for months to come. :-) I > > will leave it as is for now. > > > > The design is to have distributed Postgres-XC installed on the same HBase > > cluster, so Postgres' datanodes are on the same physical node as HBase's > > regionServers. connect HBase from PostgresSQL through existing HBase > client > > code. > > > > Step1: At Postgres coordinator node(like Master of HBase), use > > HTable.getRegionLocations to get all Regions of a particular table: > > NavigableMap<HRegionInfo, ServerName> > > Step 2: iterate through above NavigatbleMap to map HBase ServerName to > > PG-XC's dataNode. The goal is to let the dataNode of Postgres handle the > > regions on its own physical machine. > > Step 3: Postgres coordinator node send the execution plan to Postgres > > datanode , through a existing framework called foreign data wrapper. > > Step 4: Postgres DataNode iterate through its assigned regions, and open > a > > HBase Client.Scan() with .setStartRow and .setStopRow so it will only > read > > the assigned region. I was hoping to use HRegionInfo.regionId directly, > > but can find such API in Client.Scan > > Step 5: Posgres DataNode further analyse the retrieve data. > > > > So in short, the architect design is to leverage Postgres optimizer to > > parse SQL Query, and use Postgres DataNode as HBase' client to read HBase > > regions directly in parallel. With the hope to 1) read HRegion locally; > 2) > > leverage existing HBase filters. > > > > On step4 above, is there a way to talk to RegionSever directly without > > communicating with HMaster? > > > > Similar ideas(Drill for one, how about HP vertica?) are brought up > before, > > and discussed. So before I am heading down the same road, Can I pick > your > > brain, please shed me some light? or prevent me from doing something > > stupid? > > > > Many thanks > > > > Demai > > >