Thanks for the interest! We are using it in production. It is simple and hence quite stable. Though some minor pieces are missing (like https://github.com/sematext/HBaseWD/issues/1) this doesn't affect stability and/or major functionality.
Alex Baranau ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase On Fri, May 13, 2011 at 10:45 AM, Weishung Chung <[email protected]> wrote: > What's the status on this package? Is it mature enough? > I am using it in my project, tried out the write method yesterday and > going > to incorporate into read method tomorrow. > > On Wed, May 11, 2011 at 3:41 PM, Alex Baranau <[email protected] > >wrote: > > > > The start/end rows may be written twice. > > > > Yeah, I know. I meant that size of startRow+stopRow data is "bearable" in > > attribute value no matter how long are they (keys), since we already OK > > with > > transferring them initially (i.e. we should be OK with transferring 2x > > times > > more). > > > > So, what about the suggestion of sourceScan attribute value I mentioned? > If > > you can tell why it isn't sufficient in your case, I'd have more info to > > think about better suggestion ;) > > > > > It is Okay to keep all versions of your patch in the JIRA. > > > Maybe the second should be named HBASE-3811-v2.patch< > > > https://issues.apache.org/jira/secure/attachment/12478694/HBASE-3811.patch > > >? > > > > np. Can do that. Just thought that they (patches) can be sorted by date > to > > find out the final one (aka "convention over naming-rules"). > > > > Alex. > > > > On Wed, May 11, 2011 at 11:13 PM, Ted Yu <[email protected]> wrote: > > > > > >> Though it might be ok, since we anyways "transfer" start/stop rows > > with > > > Scan object. > > > In write() method, we now have: > > > Bytes.writeByteArray(out, this.startRow); > > > Bytes.writeByteArray(out, this.stopRow); > > > ... > > > for (Map.Entry<String, byte[]> attr : this.attributes.entrySet()) > { > > > WritableUtils.writeString(out, attr.getKey()); > > > Bytes.writeByteArray(out, attr.getValue()); > > > } > > > The start/end rows may be written twice. > > > > > > Of course, you have full control over how to generate the unique ID for > > > "sourceScan" attribute. > > > > > > It is Okay to keep all versions of your patch in the JIRA. Maybe the > > second > > > should be named HBASE-3811-v2.patch< > > > https://issues.apache.org/jira/secure/attachment/12478694/HBASE-3811.patch > > >? > > > > > > Thanks > > > > > > > > > On Wed, May 11, 2011 at 1:01 PM, Alex Baranau < > [email protected] > > >wrote: > > > > > >> > Can you remove the first version ? > > >> Isn't it ok to keep it in JIRA issue? > > >> > > >> > > >> > In HBaseWD, can you use reflection to detect whether Scan supports > > >> setAttribute() ? > > >> > If it does, can you encode start row and end row as "sourceScan" > > >> attribute ? > > >> > > >> Yeah, smth like this is going to be implemented. Though I'd still want > > to > > >> hear from the devs the story about Scan version. > > >> > > >> > > >> > One consideration is that start row or end row may be quite long. > > >> > > >> Yeah, that is was my though too at first. Though it might be ok, since > > we > > >> anyways "transfer" start/stop rows with Scan object. > > >> > > >> > What do you think ? > > >> > > >> I'd love to hear from you is this variant I mentioned is what we are > > >> looking at here: > > >> > > >> > > >> > From what I understand, you want to distinguish scans fired by the > > same > > >> distributed scan. > > >> > I.e. group scans which were fired by single distributed scan. If > > that's > > >> what you want, distributed > > >> > scan can generate unique ID and set, say "sourceScan" attribute to > its > > >> value. This way we'll > > >> > have <# of distinct "sourceScan" attribute values> = <number of > > >> distributed scans invoked by > > >> > client side> and two scans on server side will have the same > > >> "sourceScan" attribute iff they > > >> > "belong" to same distributed scan. > > >> > > >> > > >> Alex Baranau > > >> ---- > > >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - > > >> HBase > > >> > > >> On Wed, May 11, 2011 at 5:15 PM, Ted Yu <[email protected]> wrote: > > >> > > >>> Alex: > > >>> Your second patch looks good. > > >>> Can you remove the first version ? > > >>> > > >>> In HBaseWD, can you use reflection to detect whether Scan supports > > >>> setAttribute() ? > > >>> If it does, can you encode start row and end row as "sourceScan" > > >>> attribute ? > > >>> > > >>> One consideration is that start row or end row may be quite long. > > >>> Ideally we should store hash code of source Scan object as > "sourceScan" > > >>> attribute. But Scan doesn't implement hashCode(). We can add it, that > > would > > >>> require running all Scan related tests. > > >>> > > >>> What do you think ? > > >>> > > >>> Thanks > > >>> > > >>> > > >>> On Tue, May 10, 2011 at 5:46 AM, Alex Baranau < > > [email protected]>wrote: > > >>> > > >>>> Sorry for the delay in response (public holidays here). > > >>>> > > >>>> This depends on what info you are looking for on server side. > > >>>> > > >>>> From what I understand, you want to distinguish scans fired by the > > same > > >>>> distributed scan. I.e. group scans which were fired by single > > distributed > > >>>> scan. If that's what you want, distributed scan can generate unique > ID > > and > > >>>> set, say "sourceScan" attribute to its value. This way we'll have <# > > of > > >>>> distinct "sourceScan" attribute values> = <number of distributed > scans > > >>>> invoked by client side> and two scans on server side will have the > > same > > >>>> "sourceScan" attribute iff they "belong" to same distributed scan. > > >>>> > > >>>> Is this what are you looking for? > > >>>> > > >>>> Alex Baranau > > >>>> > > >>>> P.S. attached patch for HBASE-3811< > > https://issues.apache.org/jira/browse/HBASE-3811> > > >>>> . > > >>>> P.S-2. should this conversation be moved to dev list? > > >>>> > > >>>> ---- > > >>>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop > - > > >>>> HBase > > >>>> > > >>>> On Fri, May 6, 2011 at 12:06 AM, Ted Yu <[email protected]> > wrote: > > >>>> > > >>>>> Alex: > > >>>>> What type of identification should we put in the map of the Scan > > object > > >>>>> ? > > >>>>> I am thinking of using the Id of RowKeyDistributor. But the user > can > > >>>>> use same distributor on multiple scans. > > >>>>> > > >>>>> Please share your thought. > > >>>>> > > >>>>> > > >>>>> On Thu, Apr 21, 2011 at 8:32 AM, Alex Baranau < > > >>>>> [email protected]> wrote: > > >>>>> > > >>>>>> https://issues.apache.org/jira/browse/HBASE-3811 > > >>>>>> > > >>>>>> Alex Baranau > > >>>>>> ---- > > >>>>>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - > Hadoop > > - > > >>>>>> HBase > > >>>>>> > > >>>>>> On Thu, Apr 21, 2011 at 5:57 PM, Ted Yu <[email protected]> > > wrote: > > >>>>>> > > >>>>>> > My plan was to make regions that have active scanners more > stable > > - > > >>>>>> trying > > >>>>>> > not to move them when balancing. > > >>>>>> > I prefer second approach - adding custom attribute(s) to Scan so > > >>>>>> that the > > >>>>>> > Scans created by the method below can be 'grouped'. > > >>>>>> > > > >>>>>> > If you can file a JIRA, that would be great. > > >>>>>> > > > >>>>>> > On Thu, Apr 21, 2011 at 7:23 AM, Alex Baranau < > > >>>>>> [email protected] > > >>>>>> > >wrote: > > >>>>>> > > > >>>>>> > > Aha, so you want to "count" it as single scan (or just > > >>>>>> differently) when > > >>>>>> > > determining the load? > > >>>>>> > > > > >>>>>> > > The current code looks like this: > > >>>>>> > > > > >>>>>> > > class DistributedScanner: > > >>>>>> > > public static DistributedScanner create(HTable hTable, Scan > > >>>>>> original, > > >>>>>> > > AbstractRowKeyDistributor keyDistributor) throws IOException { > > >>>>>> > > byte[][] startKeys = > > >>>>>> > > keyDistributor.getAllDistributedKeys(original.getStartRow()); > > >>>>>> > > byte[][] stopKeys = > > >>>>>> > > keyDistributor.getAllDistributedKeys(original.getStopRow()); > > >>>>>> > > Scan[] scans = new Scan[startKeys.length]; > > >>>>>> > > for (byte i = 0; i < startKeys.length; i++) { > > >>>>>> > > scans[i] = new Scan(original); > > >>>>>> > > scans[i].setStartRow(startKeys[i]); > > >>>>>> > > scans[i].setStopRow(stopKeys[i]); > > >>>>>> > > } > > >>>>>> > > > > >>>>>> > > ResultScanner[] rss = new ResultScanner[startKeys.length]; > > >>>>>> > > for (byte i = 0; i < scans.length; i++) { > > >>>>>> > > rss[i] = hTable.getScanner(scans[i]); > > >>>>>> > > } > > >>>>>> > > > > >>>>>> > > return new DistributedScanner(rss); > > >>>>>> > > } > > >>>>>> > > > > >>>>>> > > This is client code. To make these scans "identifiable" we > need > > to > > >>>>>> either > > >>>>>> > > use some different (derived from Scan) class or add some > > attribute > > >>>>>> to > > >>>>>> > them. > > >>>>>> > > There's no API for doing the latter. But we can do the former, > > but > > >>>>>> I > > >>>>>> > don't > > >>>>>> > > really like the idea of creating extra class (with no extra > > >>>>>> > functionality) > > >>>>>> > > just to distinguish it from the base one. > > >>>>>> > > > > >>>>>> > > If you can share why/how do you want to treat them differently > > on > > >>>>>> server > > >>>>>> > > side, that would be helpful. > > >>>>>> > > > > >>>>>> > > Alex Baranau > > >>>>>> > > ---- > > >>>>>> > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - > > >>>>>> Hadoop - > > >>>>>> > HBase > > >>>>>> > > > > >>>>>> > > On Thu, Apr 21, 2011 at 4:58 PM, Ted Yu <[email protected]> > > >>>>>> wrote: > > >>>>>> > > > > >>>>>> > > > My request would be to make the distributed scan > identifiable > > >>>>>> from > > >>>>>> > server > > >>>>>> > > > side. > > >>>>>> > > > :-) > > >>>>>> > > > > > >>>>>> > > > On Thu, Apr 21, 2011 at 5:45 AM, Alex Baranau < > > >>>>>> > [email protected] > > >>>>>> > > > >wrote: > > >>>>>> > > > > > >>>>>> > > > > > Basically bucketsCount may not equal number of regions > for > > >>>>>> the > > >>>>>> > > > underlying > > >>>>>> > > > > > table. > > >>>>>> > > > > > > >>>>>> > > > > True: e.g. when there's only one region that holds data > for > > >>>>>> the whole > > >>>>>> > > > table > > >>>>>> > > > > (not many records in table yet), distributed scan will > fire > > N > > >>>>>> scans > > >>>>>> > > > against > > >>>>>> > > > > the same region. > > >>>>>> > > > > On the other hand, in case there are huge number of > regions > > >>>>>> for > > >>>>>> > single > > >>>>>> > > > > table, each scan can span over multiple regions. > > >>>>>> > > > > > > >>>>>> > > > > > I need to deal with normal scan and "distributed scan" > at > > >>>>>> server > > >>>>>> > > side. > > >>>>>> > > > > > > >>>>>> > > > > With current implementation "distributed" scan won't be > > >>>>>> recognized as > > >>>>>> > > > > something special on the server side. It will be an > ordinary > > >>>>>> scan. > > >>>>>> > > Though > > >>>>>> > > > > the number of scan will increase, given that the typical > > >>>>>> situation is > > >>>>>> > > > "many > > >>>>>> > > > > regions for single table", the scans of the same > > "distributed > > >>>>>> scan" > > >>>>>> > are > > >>>>>> > > > > likely not to hit the same region. > > >>>>>> > > > > > > >>>>>> > > > > Not sure if I answered your questions here. Feel free to > ask > > >>>>>> more ;) > > >>>>>> > > > > > > >>>>>> > > > > Alex Baranau > > >>>>>> > > > > ---- > > >>>>>> > > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > - > > >>>>>> Hadoop - > > >>>>>> > > > HBase > > >>>>>> > > > > > > >>>>>> > > > > On Wed, Apr 20, 2011 at 2:10 PM, Ted Yu < > > [email protected]> > > >>>>>> wrote: > > >>>>>> > > > > > > >>>>>> > > > > > Alex: > > >>>>>> > > > > > If you read this, you would know why I asked: > > >>>>>> > > > > > https://issues.apache.org/jira/browse/HBASE-3679 > > >>>>>> > > > > > > > >>>>>> > > > > > I need to deal with normal scan and "distributed scan" > at > > >>>>>> server > > >>>>>> > > side. > > >>>>>> > > > > > Basically bucketsCount may not equal number of regions > for > > >>>>>> the > > >>>>>> > > > underlying > > >>>>>> > > > > > table. > > >>>>>> > > > > > > > >>>>>> > > > > > Cheers > > >>>>>> > > > > > > > >>>>>> > > > > > On Tue, Apr 19, 2011 at 11:11 PM, Alex Baranau < > > >>>>>> > > > [email protected] > > >>>>>> > > > > > >wrote: > > >>>>>> > > > > > > > >>>>>> > > > > > > Hi Ted, > > >>>>>> > > > > > > > > >>>>>> > > > > > > We currently use this tool in the scenario where data > is > > >>>>>> consumed > > >>>>>> > > by > > >>>>>> > > > > > > MapReduce jobs, so we haven't tested the performance > of > > >>>>>> pure > > >>>>>> > > > > "distributed > > >>>>>> > > > > > > scan" (i.e. N scans instead of 1) a lot. I expect it > to > > be > > >>>>>> close > > >>>>>> > to > > >>>>>> > > > > > simple > > >>>>>> > > > > > > scan performance, or may be sometimes even faster > > >>>>>> depending on > > >>>>>> > your > > >>>>>> > > > > data > > >>>>>> > > > > > > access patterns. E.g. in case you write timeseries > data > > >>>>>> > > (sequential) > > >>>>>> > > > > > which > > >>>>>> > > > > > > is written into the single region at a time, then e.g. > > if > > >>>>>> you > > >>>>>> > > access > > >>>>>> > > > > > delta > > >>>>>> > > > > > > for further processing/analysis (esp. if from not > single > > >>>>>> client) > > >>>>>> > > > these > > >>>>>> > > > > > > scans > > >>>>>> > > > > > > are likely to hit the same region or couple of regions > > at > > >>>>>> a time, > > >>>>>> > > > which > > >>>>>> > > > > > may > > >>>>>> > > > > > > perform worse comparing to many scans hitting data > that > > is > > >>>>>> much > > >>>>>> > > > better > > >>>>>> > > > > > > spread over region servers. > > >>>>>> > > > > > > > > >>>>>> > > > > > > As for map-reduce job the approach should not affect > > >>>>>> reading > > >>>>>> > > > > performance > > >>>>>> > > > > > at > > >>>>>> > > > > > > all: it's just that there are bucketsCount times more > > >>>>>> splits and > > >>>>>> > > > hence > > >>>>>> > > > > > > bucketsCount times more Map tasks. In many cases this > > even > > >>>>>> > improves > > >>>>>> > > > > > overall > > >>>>>> > > > > > > performance of the MR job since work is better > > distributed > > >>>>>> over > > >>>>>> > > > cluster > > >>>>>> > > > > > > (esp. in situation when the aim is to constantly > process > > >>>>>> the > > >>>>>> > coming > > >>>>>> > > > > delta > > >>>>>> > > > > > > which usually resides in one or just couple of regions > > >>>>>> depending > > >>>>>> > on > > >>>>>> > > > > > > processing frequency). > > >>>>>> > > > > > > > > >>>>>> > > > > > > If you can share details on your case, that will help > to > > >>>>>> > understand > > >>>>>> > > > > what > > >>>>>> > > > > > > effect(s) to expect from using this approach. > > >>>>>> > > > > > > > > >>>>>> > > > > > > Alex Baranau > > >>>>>> > > > > > > ---- > > >>>>>> > > > > > > Sematext :: http://sematext.com/ :: Solr - Lucene - > > Nutch > > >>>>>> - > > >>>>>> > Hadoop > > >>>>>> > > - > > >>>>>> > > > > > HBase > > >>>>>> > > > > > > > > >>>>>> > > > > > > On Wed, Apr 20, 2011 at 8:17 AM, Ted Yu < > > >>>>>> [email protected]> > > >>>>>> > > wrote: > > >>>>>> > > > > > > > > >>>>>> > > > > > > > Interesting project, Alex. > > >>>>>> > > > > > > > Since there're bucketsCount scanners compared to one > > >>>>>> scanner > > >>>>>> > > > > > originally, > > >>>>>> > > > > > > > have you performed load testing to see the impact ? > > >>>>>> > > > > > > > > > >>>>>> > > > > > > > Thanks > > >>>>>> > > > > > > > > > >>>>>> > > > > > > > On Tue, Apr 19, 2011 at 10:25 AM, Alex Baranau < > > >>>>>> > > > > > [email protected] > > >>>>>> > > > > > > > >wrote: > > >>>>>> > > > > > > > > > >>>>>> > > > > > > > > Hello guys, > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > I'd like to introduce a new small java project/lib > > >>>>>> around > > >>>>>> > > HBase: > > >>>>>> > > > > > > HBaseWD. > > >>>>>> > > > > > > > > It > > >>>>>> > > > > > > > > is aimed to help with distribution of the load > > (across > > >>>>>> > > > > regionservers) > > >>>>>> > > > > > > > when > > >>>>>> > > > > > > > > writing sequential (becasue of the row key nature) > > >>>>>> records. > > >>>>>> > It > > >>>>>> > > > > > > implements > > >>>>>> > > > > > > > > the solution which was discussed several times on > > this > > >>>>>> > mailing > > >>>>>> > > > list > > >>>>>> > > > > > > (e.g. > > >>>>>> > > > > > > > > here: http://search-hadoop.com/m/gNRA82No5Wk). > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > Please find the sources at > > >>>>>> > > > > > https://github.com/sematext/HBaseWD(there's > > >>>>>> > > > > > > > > also > > >>>>>> > > > > > > > > a jar of current version for convenience). It is > > very > > >>>>>> easy to > > >>>>>> > > > make > > >>>>>> > > > > > use > > >>>>>> > > > > > > of > > >>>>>> > > > > > > > > it: e.g. I added it to one existing project with > 1+2 > > >>>>>> lines of > > >>>>>> > > > code > > >>>>>> > > > > > (one > > >>>>>> > > > > > > > > where I write to HBase and 2 for configuring > > MapReduce > > >>>>>> job). > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > Any feedback is highly appreciated! > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > Please find below the short intro to the lib [1]. > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > Alex Baranau > > >>>>>> > > > > > > > > ---- > > >>>>>> > > > > > > > > Sematext :: http://sematext.com/ :: Solr - Lucene > - > > >>>>>> Nutch - > > >>>>>> > > > Hadoop > > >>>>>> > > > > - > > >>>>>> > > > > > > > HBase > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > [1] > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > Description: > > >>>>>> > > > > > > > > ------------ > > >>>>>> > > > > > > > > HBaseWD stands for Distributing (sequential) > Writes. > > >>>>>> It was > > >>>>>> > > > > inspired > > >>>>>> > > > > > by > > >>>>>> > > > > > > > > discussions on HBase mailing lists around the > > problem > > >>>>>> of > > >>>>>> > > choosing > > >>>>>> > > > > > > > between: > > >>>>>> > > > > > > > > * writing records with sequential row keys (e.g. > > >>>>>> time-series > > >>>>>> > > data > > >>>>>> > > > > > with > > >>>>>> > > > > > > > row > > >>>>>> > > > > > > > > key > > >>>>>> > > > > > > > > built based on ts) > > >>>>>> > > > > > > > > * using random unique IDs for records > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > First approach makes possible to perform fast > range > > >>>>>> scans > > >>>>>> > with > > >>>>>> > > > help > > >>>>>> > > > > > of > > >>>>>> > > > > > > > > setting > > >>>>>> > > > > > > > > start/stop keys on Scanner, but creates single > > region > > >>>>>> server > > >>>>>> > > > > > > hot-spotting > > >>>>>> > > > > > > > > problem upon writing data (as row keys go in > > sequence > > >>>>>> all > > >>>>>> > > records > > >>>>>> > > > > end > > >>>>>> > > > > > > up > > >>>>>> > > > > > > > > written into a single region at a time). > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > Second approach aims for fastest writing > performance > > >>>>>> by > > >>>>>> > > > > distributing > > >>>>>> > > > > > > new > > >>>>>> > > > > > > > > records over random regions but makes not possible > > >>>>>> doing fast > > >>>>>> > > > range > > >>>>>> > > > > > > scans > > >>>>>> > > > > > > > > against written data. > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > The suggested approach stays in the middle of the > > two > > >>>>>> above > > >>>>>> > and > > >>>>>> > > > > > proved > > >>>>>> > > > > > > to > > >>>>>> > > > > > > > > perform well by distributing records over the > > cluster > > >>>>>> during > > >>>>>> > > > > writing > > >>>>>> > > > > > > data > > >>>>>> > > > > > > > > while allowing range scans over it. HBaseWD > provides > > >>>>>> very > > >>>>>> > > simple > > >>>>>> > > > > API > > >>>>>> > > > > > to > > >>>>>> > > > > > > > > work with which makes it perfect to use with > > existing > > >>>>>> code. > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > Please refer to unit-tests for lib usage info as > > they > > >>>>>> aimed > > >>>>>> > to > > >>>>>> > > > act > > >>>>>> > > > > as > > >>>>>> > > > > > > > > example. > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > Brief Usage Info (Examples): > > >>>>>> > > > > > > > > ---------------------------- > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > Distributing records with sequential keys which > are > > >>>>>> being > > >>>>>> > > written > > >>>>>> > > > > in > > >>>>>> > > > > > up > > >>>>>> > > > > > > > to > > >>>>>> > > > > > > > > Byte.MAX_VALUE buckets: > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > byte bucketsCount = (byte) 32; // distributing > > into > > >>>>>> 32 > > >>>>>> > > buckets > > >>>>>> > > > > > > > > RowKeyDistributor keyDistributor = > > >>>>>> > > > > > > > > new > > >>>>>> > > > > > > > > RowKeyDistributorByOneBytePrefix(bucketsCount); > > >>>>>> > > > > > > > > for (int i = 0; i < 100; i++) { > > >>>>>> > > > > > > > > Put put = new > > >>>>>> > > > > > Put(keyDistributor.getDistributedKey(originalKey)); > > >>>>>> > > > > > > > > ... // add values > > >>>>>> > > > > > > > > hTable.put(put); > > >>>>>> > > > > > > > > } > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > Performing a range scan over written data > > (internally > > >>>>>> > > > > <bucketsCount> > > >>>>>> > > > > > > > > scanners > > >>>>>> > > > > > > > > executed): > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > Scan scan = new Scan(startKey, stopKey); > > >>>>>> > > > > > > > > ResultScanner rs = > > >>>>>> DistributedScanner.create(hTable, scan, > > >>>>>> > > > > > > > > keyDistributor); > > >>>>>> > > > > > > > > for (Result current : rs) { > > >>>>>> > > > > > > > > ... > > >>>>>> > > > > > > > > } > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > Performing mapreduce job over written data chunk > > >>>>>> specified by > > >>>>>> > > > Scan: > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > Configuration conf = > HBaseConfiguration.create(); > > >>>>>> > > > > > > > > Job job = new Job(conf, "testMapreduceJob"); > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > Scan scan = new Scan(startKey, stopKey); > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > TableMapReduceUtil.initTableMapperJob("table", > > >>>>>> scan, > > >>>>>> > > > > > > > > RowCounterMapper.class, > > >>>>>> ImmutableBytesWritable.class, > > >>>>>> > > > > > > Result.class, > > >>>>>> > > > > > > > > job); > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > // Substituting standard TableInputFormat which > > was > > >>>>>> set in > > >>>>>> > > > > > > > > // TableMapReduceUtil.initTableMapperJob(...) > > >>>>>> > > > > > > > > > > job.setInputFormatClass(WdTableInputFormat.class); > > >>>>>> > > > > > > > > keyDistributor.addInfo(job.getConfiguration()); > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > Extending Row Keys Distributing Patterns: > > >>>>>> > > > > > > > > ----------------------------------------- > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > HBaseWD is designed to be flexible and to support > > >>>>>> custom row > > >>>>>> > > key > > >>>>>> > > > > > > > > distribution > > >>>>>> > > > > > > > > approaches. To define custom row key distributing > > >>>>>> logic just > > >>>>>> > > > > > implement > > >>>>>> > > > > > > > > AbstractRowKeyDistributor abstract class which is > > >>>>>> really very > > >>>>>> > > > > simple: > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > public abstract class AbstractRowKeyDistributor > > >>>>>> implements > > >>>>>> > > > > > > > > Parametrizable { > > >>>>>> > > > > > > > > public abstract byte[] > getDistributedKey(byte[] > > >>>>>> > > > originalKey); > > >>>>>> > > > > > > > > public abstract byte[] getOriginalKey(byte[] > > >>>>>> > adjustedKey); > > >>>>>> > > > > > > > > public abstract byte[][] > > >>>>>> getAllDistributedKeys(byte[] > > >>>>>> > > > > > > originalKey); > > >>>>>> > > > > > > > > ... // some utility methods > > >>>>>> > > > > > > > > } > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > > >>>>>> > > > > > > > > >>>>>> > > > > > > > >>>>>> > > > > > > >>>>>> > > > > > >>>>>> > > > > >>>>>> > > > >>>>>> > > >>>>> > > >>>>> > > >>>> > > >>> > > >> > > > > > >
