Yes, it's simple yet useful. I am integrating it. Thanks alot :) On Fri, May 13, 2011 at 3:12 PM, Alex Baranau <[email protected]>wrote:
> Thanks for the interest! > > We are using it in production. It is simple and hence quite stable. Though > some minor pieces are missing (like > https://github.com/sematext/HBaseWD/issues/1) this doesn't affect > stability > and/or major functionality. > > Alex Baranau > ---- > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase > > On Fri, May 13, 2011 at 10:45 AM, Weishung Chung <[email protected]> > wrote: > > > What's the status on this package? Is it mature enough? > > I am using it in my project, tried out the write method yesterday and > > going > > to incorporate into read method tomorrow. > > > > On Wed, May 11, 2011 at 3:41 PM, Alex Baranau <[email protected] > > >wrote: > > > > > > The start/end rows may be written twice. > > > > > > Yeah, I know. I meant that size of startRow+stopRow data is "bearable" > in > > > attribute value no matter how long are they (keys), since we already OK > > > with > > > transferring them initially (i.e. we should be OK with transferring 2x > > > times > > > more). > > > > > > So, what about the suggestion of sourceScan attribute value I > mentioned? > > If > > > you can tell why it isn't sufficient in your case, I'd have more info > to > > > think about better suggestion ;) > > > > > > > It is Okay to keep all versions of your patch in the JIRA. > > > > Maybe the second should be named HBASE-3811-v2.patch< > > > > > > https://issues.apache.org/jira/secure/attachment/12478694/HBASE-3811.patch > > > >? > > > > > > np. Can do that. Just thought that they (patches) can be sorted by date > > to > > > find out the final one (aka "convention over naming-rules"). > > > > > > Alex. > > > > > > On Wed, May 11, 2011 at 11:13 PM, Ted Yu <[email protected]> wrote: > > > > > > > >> Though it might be ok, since we anyways "transfer" start/stop rows > > > with > > > > Scan object. > > > > In write() method, we now have: > > > > Bytes.writeByteArray(out, this.startRow); > > > > Bytes.writeByteArray(out, this.stopRow); > > > > ... > > > > for (Map.Entry<String, byte[]> attr : > this.attributes.entrySet()) > > { > > > > WritableUtils.writeString(out, attr.getKey()); > > > > Bytes.writeByteArray(out, attr.getValue()); > > > > } > > > > The start/end rows may be written twice. > > > > > > > > Of course, you have full control over how to generate the unique ID > for > > > > "sourceScan" attribute. > > > > > > > > It is Okay to keep all versions of your patch in the JIRA. Maybe the > > > second > > > > should be named HBASE-3811-v2.patch< > > > > > > https://issues.apache.org/jira/secure/attachment/12478694/HBASE-3811.patch > > > >? > > > > > > > > Thanks > > > > > > > > > > > > On Wed, May 11, 2011 at 1:01 PM, Alex Baranau < > > [email protected] > > > >wrote: > > > > > > > >> > Can you remove the first version ? > > > >> Isn't it ok to keep it in JIRA issue? > > > >> > > > >> > > > >> > In HBaseWD, can you use reflection to detect whether Scan supports > > > >> setAttribute() ? > > > >> > If it does, can you encode start row and end row as "sourceScan" > > > >> attribute ? > > > >> > > > >> Yeah, smth like this is going to be implemented. Though I'd still > want > > > to > > > >> hear from the devs the story about Scan version. > > > >> > > > >> > > > >> > One consideration is that start row or end row may be quite long. > > > >> > > > >> Yeah, that is was my though too at first. Though it might be ok, > since > > > we > > > >> anyways "transfer" start/stop rows with Scan object. > > > >> > > > >> > What do you think ? > > > >> > > > >> I'd love to hear from you is this variant I mentioned is what we are > > > >> looking at here: > > > >> > > > >> > > > >> > From what I understand, you want to distinguish scans fired by the > > > same > > > >> distributed scan. > > > >> > I.e. group scans which were fired by single distributed scan. If > > > that's > > > >> what you want, distributed > > > >> > scan can generate unique ID and set, say "sourceScan" attribute to > > its > > > >> value. This way we'll > > > >> > have <# of distinct "sourceScan" attribute values> = <number of > > > >> distributed scans invoked by > > > >> > client side> and two scans on server side will have the same > > > >> "sourceScan" attribute iff they > > > >> > "belong" to same distributed scan. > > > >> > > > >> > > > >> Alex Baranau > > > >> ---- > > > >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop > - > > > >> HBase > > > >> > > > >> On Wed, May 11, 2011 at 5:15 PM, Ted Yu <[email protected]> > wrote: > > > >> > > > >>> Alex: > > > >>> Your second patch looks good. > > > >>> Can you remove the first version ? > > > >>> > > > >>> In HBaseWD, can you use reflection to detect whether Scan supports > > > >>> setAttribute() ? > > > >>> If it does, can you encode start row and end row as "sourceScan" > > > >>> attribute ? > > > >>> > > > >>> One consideration is that start row or end row may be quite long. > > > >>> Ideally we should store hash code of source Scan object as > > "sourceScan" > > > >>> attribute. But Scan doesn't implement hashCode(). We can add it, > that > > > would > > > >>> require running all Scan related tests. > > > >>> > > > >>> What do you think ? > > > >>> > > > >>> Thanks > > > >>> > > > >>> > > > >>> On Tue, May 10, 2011 at 5:46 AM, Alex Baranau < > > > [email protected]>wrote: > > > >>> > > > >>>> Sorry for the delay in response (public holidays here). > > > >>>> > > > >>>> This depends on what info you are looking for on server side. > > > >>>> > > > >>>> From what I understand, you want to distinguish scans fired by the > > > same > > > >>>> distributed scan. I.e. group scans which were fired by single > > > distributed > > > >>>> scan. If that's what you want, distributed scan can generate > unique > > ID > > > and > > > >>>> set, say "sourceScan" attribute to its value. This way we'll have > <# > > > of > > > >>>> distinct "sourceScan" attribute values> = <number of distributed > > scans > > > >>>> invoked by client side> and two scans on server side will have the > > > same > > > >>>> "sourceScan" attribute iff they "belong" to same distributed scan. > > > >>>> > > > >>>> Is this what are you looking for? > > > >>>> > > > >>>> Alex Baranau > > > >>>> > > > >>>> P.S. attached patch for HBASE-3811< > > > https://issues.apache.org/jira/browse/HBASE-3811> > > > >>>> . > > > >>>> P.S-2. should this conversation be moved to dev list? > > > >>>> > > > >>>> ---- > > > >>>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - > Hadoop > > - > > > >>>> HBase > > > >>>> > > > >>>> On Fri, May 6, 2011 at 12:06 AM, Ted Yu <[email protected]> > > wrote: > > > >>>> > > > >>>>> Alex: > > > >>>>> What type of identification should we put in the map of the Scan > > > object > > > >>>>> ? > > > >>>>> I am thinking of using the Id of RowKeyDistributor. But the user > > can > > > >>>>> use same distributor on multiple scans. > > > >>>>> > > > >>>>> Please share your thought. > > > >>>>> > > > >>>>> > > > >>>>> On Thu, Apr 21, 2011 at 8:32 AM, Alex Baranau < > > > >>>>> [email protected]> wrote: > > > >>>>> > > > >>>>>> https://issues.apache.org/jira/browse/HBASE-3811 > > > >>>>>> > > > >>>>>> Alex Baranau > > > >>>>>> ---- > > > >>>>>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - > > Hadoop > > > - > > > >>>>>> HBase > > > >>>>>> > > > >>>>>> On Thu, Apr 21, 2011 at 5:57 PM, Ted Yu <[email protected]> > > > wrote: > > > >>>>>> > > > >>>>>> > My plan was to make regions that have active scanners more > > stable > > > - > > > >>>>>> trying > > > >>>>>> > not to move them when balancing. > > > >>>>>> > I prefer second approach - adding custom attribute(s) to Scan > so > > > >>>>>> that the > > > >>>>>> > Scans created by the method below can be 'grouped'. > > > >>>>>> > > > > >>>>>> > If you can file a JIRA, that would be great. > > > >>>>>> > > > > >>>>>> > On Thu, Apr 21, 2011 at 7:23 AM, Alex Baranau < > > > >>>>>> [email protected] > > > >>>>>> > >wrote: > > > >>>>>> > > > > >>>>>> > > Aha, so you want to "count" it as single scan (or just > > > >>>>>> differently) when > > > >>>>>> > > determining the load? > > > >>>>>> > > > > > >>>>>> > > The current code looks like this: > > > >>>>>> > > > > > >>>>>> > > class DistributedScanner: > > > >>>>>> > > public static DistributedScanner create(HTable hTable, Scan > > > >>>>>> original, > > > >>>>>> > > AbstractRowKeyDistributor keyDistributor) throws IOException > { > > > >>>>>> > > byte[][] startKeys = > > > >>>>>> > > > keyDistributor.getAllDistributedKeys(original.getStartRow()); > > > >>>>>> > > byte[][] stopKeys = > > > >>>>>> > > keyDistributor.getAllDistributedKeys(original.getStopRow()); > > > >>>>>> > > Scan[] scans = new Scan[startKeys.length]; > > > >>>>>> > > for (byte i = 0; i < startKeys.length; i++) { > > > >>>>>> > > scans[i] = new Scan(original); > > > >>>>>> > > scans[i].setStartRow(startKeys[i]); > > > >>>>>> > > scans[i].setStopRow(stopKeys[i]); > > > >>>>>> > > } > > > >>>>>> > > > > > >>>>>> > > ResultScanner[] rss = new > ResultScanner[startKeys.length]; > > > >>>>>> > > for (byte i = 0; i < scans.length; i++) { > > > >>>>>> > > rss[i] = hTable.getScanner(scans[i]); > > > >>>>>> > > } > > > >>>>>> > > > > > >>>>>> > > return new DistributedScanner(rss); > > > >>>>>> > > } > > > >>>>>> > > > > > >>>>>> > > This is client code. To make these scans "identifiable" we > > need > > > to > > > >>>>>> either > > > >>>>>> > > use some different (derived from Scan) class or add some > > > attribute > > > >>>>>> to > > > >>>>>> > them. > > > >>>>>> > > There's no API for doing the latter. But we can do the > former, > > > but > > > >>>>>> I > > > >>>>>> > don't > > > >>>>>> > > really like the idea of creating extra class (with no extra > > > >>>>>> > functionality) > > > >>>>>> > > just to distinguish it from the base one. > > > >>>>>> > > > > > >>>>>> > > If you can share why/how do you want to treat them > differently > > > on > > > >>>>>> server > > > >>>>>> > > side, that would be helpful. > > > >>>>>> > > > > > >>>>>> > > Alex Baranau > > > >>>>>> > > ---- > > > >>>>>> > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - > > > >>>>>> Hadoop - > > > >>>>>> > HBase > > > >>>>>> > > > > > >>>>>> > > On Thu, Apr 21, 2011 at 4:58 PM, Ted Yu < > [email protected]> > > > >>>>>> wrote: > > > >>>>>> > > > > > >>>>>> > > > My request would be to make the distributed scan > > identifiable > > > >>>>>> from > > > >>>>>> > server > > > >>>>>> > > > side. > > > >>>>>> > > > :-) > > > >>>>>> > > > > > > >>>>>> > > > On Thu, Apr 21, 2011 at 5:45 AM, Alex Baranau < > > > >>>>>> > [email protected] > > > >>>>>> > > > >wrote: > > > >>>>>> > > > > > > >>>>>> > > > > > Basically bucketsCount may not equal number of regions > > for > > > >>>>>> the > > > >>>>>> > > > underlying > > > >>>>>> > > > > > table. > > > >>>>>> > > > > > > > >>>>>> > > > > True: e.g. when there's only one region that holds data > > for > > > >>>>>> the whole > > > >>>>>> > > > table > > > >>>>>> > > > > (not many records in table yet), distributed scan will > > fire > > > N > > > >>>>>> scans > > > >>>>>> > > > against > > > >>>>>> > > > > the same region. > > > >>>>>> > > > > On the other hand, in case there are huge number of > > regions > > > >>>>>> for > > > >>>>>> > single > > > >>>>>> > > > > table, each scan can span over multiple regions. > > > >>>>>> > > > > > > > >>>>>> > > > > > I need to deal with normal scan and "distributed scan" > > at > > > >>>>>> server > > > >>>>>> > > side. > > > >>>>>> > > > > > > > >>>>>> > > > > With current implementation "distributed" scan won't be > > > >>>>>> recognized as > > > >>>>>> > > > > something special on the server side. It will be an > > ordinary > > > >>>>>> scan. > > > >>>>>> > > Though > > > >>>>>> > > > > the number of scan will increase, given that the typical > > > >>>>>> situation is > > > >>>>>> > > > "many > > > >>>>>> > > > > regions for single table", the scans of the same > > > "distributed > > > >>>>>> scan" > > > >>>>>> > are > > > >>>>>> > > > > likely not to hit the same region. > > > >>>>>> > > > > > > > >>>>>> > > > > Not sure if I answered your questions here. Feel free to > > ask > > > >>>>>> more ;) > > > >>>>>> > > > > > > > >>>>>> > > > > Alex Baranau > > > >>>>>> > > > > ---- > > > >>>>>> > > > > Sematext :: http://sematext.com/ :: Solr - Lucene - > Nutch > > - > > > >>>>>> Hadoop - > > > >>>>>> > > > HBase > > > >>>>>> > > > > > > > >>>>>> > > > > On Wed, Apr 20, 2011 at 2:10 PM, Ted Yu < > > > [email protected]> > > > >>>>>> wrote: > > > >>>>>> > > > > > > > >>>>>> > > > > > Alex: > > > >>>>>> > > > > > If you read this, you would know why I asked: > > > >>>>>> > > > > > https://issues.apache.org/jira/browse/HBASE-3679 > > > >>>>>> > > > > > > > > >>>>>> > > > > > I need to deal with normal scan and "distributed scan" > > at > > > >>>>>> server > > > >>>>>> > > side. > > > >>>>>> > > > > > Basically bucketsCount may not equal number of regions > > for > > > >>>>>> the > > > >>>>>> > > > underlying > > > >>>>>> > > > > > table. > > > >>>>>> > > > > > > > > >>>>>> > > > > > Cheers > > > >>>>>> > > > > > > > > >>>>>> > > > > > On Tue, Apr 19, 2011 at 11:11 PM, Alex Baranau < > > > >>>>>> > > > [email protected] > > > >>>>>> > > > > > >wrote: > > > >>>>>> > > > > > > > > >>>>>> > > > > > > Hi Ted, > > > >>>>>> > > > > > > > > > >>>>>> > > > > > > We currently use this tool in the scenario where > data > > is > > > >>>>>> consumed > > > >>>>>> > > by > > > >>>>>> > > > > > > MapReduce jobs, so we haven't tested the performance > > of > > > >>>>>> pure > > > >>>>>> > > > > "distributed > > > >>>>>> > > > > > > scan" (i.e. N scans instead of 1) a lot. I expect it > > to > > > be > > > >>>>>> close > > > >>>>>> > to > > > >>>>>> > > > > > simple > > > >>>>>> > > > > > > scan performance, or may be sometimes even faster > > > >>>>>> depending on > > > >>>>>> > your > > > >>>>>> > > > > data > > > >>>>>> > > > > > > access patterns. E.g. in case you write timeseries > > data > > > >>>>>> > > (sequential) > > > >>>>>> > > > > > which > > > >>>>>> > > > > > > is written into the single region at a time, then > e.g. > > > if > > > >>>>>> you > > > >>>>>> > > access > > > >>>>>> > > > > > delta > > > >>>>>> > > > > > > for further processing/analysis (esp. if from not > > single > > > >>>>>> client) > > > >>>>>> > > > these > > > >>>>>> > > > > > > scans > > > >>>>>> > > > > > > are likely to hit the same region or couple of > regions > > > at > > > >>>>>> a time, > > > >>>>>> > > > which > > > >>>>>> > > > > > may > > > >>>>>> > > > > > > perform worse comparing to many scans hitting data > > that > > > is > > > >>>>>> much > > > >>>>>> > > > better > > > >>>>>> > > > > > > spread over region servers. > > > >>>>>> > > > > > > > > > >>>>>> > > > > > > As for map-reduce job the approach should not affect > > > >>>>>> reading > > > >>>>>> > > > > performance > > > >>>>>> > > > > > at > > > >>>>>> > > > > > > all: it's just that there are bucketsCount times > more > > > >>>>>> splits and > > > >>>>>> > > > hence > > > >>>>>> > > > > > > bucketsCount times more Map tasks. In many cases > this > > > even > > > >>>>>> > improves > > > >>>>>> > > > > > overall > > > >>>>>> > > > > > > performance of the MR job since work is better > > > distributed > > > >>>>>> over > > > >>>>>> > > > cluster > > > >>>>>> > > > > > > (esp. in situation when the aim is to constantly > > process > > > >>>>>> the > > > >>>>>> > coming > > > >>>>>> > > > > delta > > > >>>>>> > > > > > > which usually resides in one or just couple of > regions > > > >>>>>> depending > > > >>>>>> > on > > > >>>>>> > > > > > > processing frequency). > > > >>>>>> > > > > > > > > > >>>>>> > > > > > > If you can share details on your case, that will > help > > to > > > >>>>>> > understand > > > >>>>>> > > > > what > > > >>>>>> > > > > > > effect(s) to expect from using this approach. > > > >>>>>> > > > > > > > > > >>>>>> > > > > > > Alex Baranau > > > >>>>>> > > > > > > ---- > > > >>>>>> > > > > > > Sematext :: http://sematext.com/ :: Solr - Lucene - > > > Nutch > > > >>>>>> - > > > >>>>>> > Hadoop > > > >>>>>> > > - > > > >>>>>> > > > > > HBase > > > >>>>>> > > > > > > > > > >>>>>> > > > > > > On Wed, Apr 20, 2011 at 8:17 AM, Ted Yu < > > > >>>>>> [email protected]> > > > >>>>>> > > wrote: > > > >>>>>> > > > > > > > > > >>>>>> > > > > > > > Interesting project, Alex. > > > >>>>>> > > > > > > > Since there're bucketsCount scanners compared to > one > > > >>>>>> scanner > > > >>>>>> > > > > > originally, > > > >>>>>> > > > > > > > have you performed load testing to see the impact > ? > > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > Thanks > > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > On Tue, Apr 19, 2011 at 10:25 AM, Alex Baranau < > > > >>>>>> > > > > > [email protected] > > > >>>>>> > > > > > > > >wrote: > > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > Hello guys, > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > I'd like to introduce a new small java > project/lib > > > >>>>>> around > > > >>>>>> > > HBase: > > > >>>>>> > > > > > > HBaseWD. > > > >>>>>> > > > > > > > > It > > > >>>>>> > > > > > > > > is aimed to help with distribution of the load > > > (across > > > >>>>>> > > > > regionservers) > > > >>>>>> > > > > > > > when > > > >>>>>> > > > > > > > > writing sequential (becasue of the row key > nature) > > > >>>>>> records. > > > >>>>>> > It > > > >>>>>> > > > > > > implements > > > >>>>>> > > > > > > > > the solution which was discussed several times > on > > > this > > > >>>>>> > mailing > > > >>>>>> > > > list > > > >>>>>> > > > > > > (e.g. > > > >>>>>> > > > > > > > > here: http://search-hadoop.com/m/gNRA82No5Wk). > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > Please find the sources at > > > >>>>>> > > > > > https://github.com/sematext/HBaseWD(there's > > > >>>>>> > > > > > > > > also > > > >>>>>> > > > > > > > > a jar of current version for convenience). It is > > > very > > > >>>>>> easy to > > > >>>>>> > > > make > > > >>>>>> > > > > > use > > > >>>>>> > > > > > > of > > > >>>>>> > > > > > > > > it: e.g. I added it to one existing project with > > 1+2 > > > >>>>>> lines of > > > >>>>>> > > > code > > > >>>>>> > > > > > (one > > > >>>>>> > > > > > > > > where I write to HBase and 2 for configuring > > > MapReduce > > > >>>>>> job). > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > Any feedback is highly appreciated! > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > Please find below the short intro to the lib > [1]. > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > Alex Baranau > > > >>>>>> > > > > > > > > ---- > > > >>>>>> > > > > > > > > Sematext :: http://sematext.com/ :: Solr - > Lucene > > - > > > >>>>>> Nutch - > > > >>>>>> > > > Hadoop > > > >>>>>> > > > > - > > > >>>>>> > > > > > > > HBase > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > [1] > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > Description: > > > >>>>>> > > > > > > > > ------------ > > > >>>>>> > > > > > > > > HBaseWD stands for Distributing (sequential) > > Writes. > > > >>>>>> It was > > > >>>>>> > > > > inspired > > > >>>>>> > > > > > by > > > >>>>>> > > > > > > > > discussions on HBase mailing lists around the > > > problem > > > >>>>>> of > > > >>>>>> > > choosing > > > >>>>>> > > > > > > > between: > > > >>>>>> > > > > > > > > * writing records with sequential row keys (e.g. > > > >>>>>> time-series > > > >>>>>> > > data > > > >>>>>> > > > > > with > > > >>>>>> > > > > > > > row > > > >>>>>> > > > > > > > > key > > > >>>>>> > > > > > > > > built based on ts) > > > >>>>>> > > > > > > > > * using random unique IDs for records > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > First approach makes possible to perform fast > > range > > > >>>>>> scans > > > >>>>>> > with > > > >>>>>> > > > help > > > >>>>>> > > > > > of > > > >>>>>> > > > > > > > > setting > > > >>>>>> > > > > > > > > start/stop keys on Scanner, but creates single > > > region > > > >>>>>> server > > > >>>>>> > > > > > > hot-spotting > > > >>>>>> > > > > > > > > problem upon writing data (as row keys go in > > > sequence > > > >>>>>> all > > > >>>>>> > > records > > > >>>>>> > > > > end > > > >>>>>> > > > > > > up > > > >>>>>> > > > > > > > > written into a single region at a time). > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > Second approach aims for fastest writing > > performance > > > >>>>>> by > > > >>>>>> > > > > distributing > > > >>>>>> > > > > > > new > > > >>>>>> > > > > > > > > records over random regions but makes not > possible > > > >>>>>> doing fast > > > >>>>>> > > > range > > > >>>>>> > > > > > > scans > > > >>>>>> > > > > > > > > against written data. > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > The suggested approach stays in the middle of > the > > > two > > > >>>>>> above > > > >>>>>> > and > > > >>>>>> > > > > > proved > > > >>>>>> > > > > > > to > > > >>>>>> > > > > > > > > perform well by distributing records over the > > > cluster > > > >>>>>> during > > > >>>>>> > > > > writing > > > >>>>>> > > > > > > data > > > >>>>>> > > > > > > > > while allowing range scans over it. HBaseWD > > provides > > > >>>>>> very > > > >>>>>> > > simple > > > >>>>>> > > > > API > > > >>>>>> > > > > > to > > > >>>>>> > > > > > > > > work with which makes it perfect to use with > > > existing > > > >>>>>> code. > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > Please refer to unit-tests for lib usage info as > > > they > > > >>>>>> aimed > > > >>>>>> > to > > > >>>>>> > > > act > > > >>>>>> > > > > as > > > >>>>>> > > > > > > > > example. > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > Brief Usage Info (Examples): > > > >>>>>> > > > > > > > > ---------------------------- > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > Distributing records with sequential keys which > > are > > > >>>>>> being > > > >>>>>> > > written > > > >>>>>> > > > > in > > > >>>>>> > > > > > up > > > >>>>>> > > > > > > > to > > > >>>>>> > > > > > > > > Byte.MAX_VALUE buckets: > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > byte bucketsCount = (byte) 32; // > distributing > > > into > > > >>>>>> 32 > > > >>>>>> > > buckets > > > >>>>>> > > > > > > > > RowKeyDistributor keyDistributor = > > > >>>>>> > > > > > > > > new > > > >>>>>> > > > > > > > > RowKeyDistributorByOneBytePrefix(bucketsCount); > > > >>>>>> > > > > > > > > for (int i = 0; i < 100; i++) { > > > >>>>>> > > > > > > > > Put put = new > > > >>>>>> > > > > > Put(keyDistributor.getDistributedKey(originalKey)); > > > >>>>>> > > > > > > > > ... // add values > > > >>>>>> > > > > > > > > hTable.put(put); > > > >>>>>> > > > > > > > > } > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > Performing a range scan over written data > > > (internally > > > >>>>>> > > > > <bucketsCount> > > > >>>>>> > > > > > > > > scanners > > > >>>>>> > > > > > > > > executed): > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > Scan scan = new Scan(startKey, stopKey); > > > >>>>>> > > > > > > > > ResultScanner rs = > > > >>>>>> DistributedScanner.create(hTable, scan, > > > >>>>>> > > > > > > > > keyDistributor); > > > >>>>>> > > > > > > > > for (Result current : rs) { > > > >>>>>> > > > > > > > > ... > > > >>>>>> > > > > > > > > } > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > Performing mapreduce job over written data chunk > > > >>>>>> specified by > > > >>>>>> > > > Scan: > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > Configuration conf = > > HBaseConfiguration.create(); > > > >>>>>> > > > > > > > > Job job = new Job(conf, "testMapreduceJob"); > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > Scan scan = new Scan(startKey, stopKey); > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > > TableMapReduceUtil.initTableMapperJob("table", > > > >>>>>> scan, > > > >>>>>> > > > > > > > > RowCounterMapper.class, > > > >>>>>> ImmutableBytesWritable.class, > > > >>>>>> > > > > > > Result.class, > > > >>>>>> > > > > > > > > job); > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > // Substituting standard TableInputFormat > which > > > was > > > >>>>>> set in > > > >>>>>> > > > > > > > > // TableMapReduceUtil.initTableMapperJob(...) > > > >>>>>> > > > > > > > > > > > job.setInputFormatClass(WdTableInputFormat.class); > > > >>>>>> > > > > > > > > > keyDistributor.addInfo(job.getConfiguration()); > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > Extending Row Keys Distributing Patterns: > > > >>>>>> > > > > > > > > ----------------------------------------- > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > HBaseWD is designed to be flexible and to > support > > > >>>>>> custom row > > > >>>>>> > > key > > > >>>>>> > > > > > > > > distribution > > > >>>>>> > > > > > > > > approaches. To define custom row key > distributing > > > >>>>>> logic just > > > >>>>>> > > > > > implement > > > >>>>>> > > > > > > > > AbstractRowKeyDistributor abstract class which > is > > > >>>>>> really very > > > >>>>>> > > > > simple: > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > public abstract class > AbstractRowKeyDistributor > > > >>>>>> implements > > > >>>>>> > > > > > > > > Parametrizable { > > > >>>>>> > > > > > > > > public abstract byte[] > > getDistributedKey(byte[] > > > >>>>>> > > > originalKey); > > > >>>>>> > > > > > > > > public abstract byte[] > getOriginalKey(byte[] > > > >>>>>> > adjustedKey); > > > >>>>>> > > > > > > > > public abstract byte[][] > > > >>>>>> getAllDistributedKeys(byte[] > > > >>>>>> > > > > > > originalKey); > > > >>>>>> > > > > > > > > ... // some utility methods > > > >>>>>> > > > > > > > > } > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > > > >>>>>> > > > > > > > > > >>>>>> > > > > > > > > >>>>>> > > > > > > > >>>>>> > > > > > > >>>>>> > > > > > >>>>>> > > > > >>>>>> > > > >>>>> > > > >>>>> > > > >>>> > > > >>> > > > >> > > > > > > > > > >
