Alex: Can you summarize HBaseWD in your blog, including points 1 and 2 below ?
Thanks On Wed, May 18, 2011 at 8:03 AM, Alex Baranau <[email protected]>wrote: > There are several options here. E.g.: > > 1) Given that you have "original key" of the record, you can fetch the > stored record key from HBase and use it to create Put with updated (or new) > cells. > > Currently you'll need to use distributes scan for that, there's not > analogue > for Get operation yet (see https://github.com/sematext/HBaseWD/issues/1). > > Note: you need to first find out the real key of stored record by fetching > data from HBase in case you use included in current lib > RowKeyDistributorByOneBytePrefix. Alternatively, see next option: > > 2) You can create your own RowKeyDistributor implementation which will > create "distributed key" based on original key value so that later when you > have original key and want to update the record you can calculate > distributed key without roundtrip to HBase. > > E.g. your RowKeyDistributor implementation you can calculate 1-byte hash of > original key (https://github.com/sematext/HBaseWD/issues/2). > > > > In either way you don't need to delete record to update some cells of it or > add new cells. > > Please let me know if you have more Qs! > > Alex Baranau > ---- > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase > > On Wed, May 18, 2011 at 1:19 AM, Weishung Chung <[email protected]> > wrote: > > > I have another question. For overwriting, do I need to delete the > existing > > one before re-writing it? > > > > On Sat, May 14, 2011 at 10:17 AM, Weishung Chung <[email protected]> > > wrote: > > > > > Yes, it's simple yet useful. I am integrating it. Thanks alot :) > > > > > > > > > On Fri, May 13, 2011 at 3:12 PM, Alex Baranau < > [email protected] > > >wrote: > > > > > >> Thanks for the interest! > > >> > > >> We are using it in production. It is simple and hence quite stable. > > Though > > >> some minor pieces are missing (like > > >> https://github.com/sematext/HBaseWD/issues/1) this doesn't affect > > >> stability > > >> and/or major functionality. > > >> > > >> Alex Baranau > > >> ---- > > >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - > > >> HBase > > >> > > >> On Fri, May 13, 2011 at 10:45 AM, Weishung Chung <[email protected]> > > >> wrote: > > >> > > >> > What's the status on this package? Is it mature enough? > > >> > I am using it in my project, tried out the write method yesterday > and > > >> > going > > >> > to incorporate into read method tomorrow. > > >> > > > >> > On Wed, May 11, 2011 at 3:41 PM, Alex Baranau < > > [email protected] > > >> > >wrote: > > >> > > > >> > > > The start/end rows may be written twice. > > >> > > > > >> > > Yeah, I know. I meant that size of startRow+stopRow data is > > "bearable" > > >> in > > >> > > attribute value no matter how long are they (keys), since we > already > > >> OK > > >> > > with > > >> > > transferring them initially (i.e. we should be OK with > transferring > > 2x > > >> > > times > > >> > > more). > > >> > > > > >> > > So, what about the suggestion of sourceScan attribute value I > > >> mentioned? > > >> > If > > >> > > you can tell why it isn't sufficient in your case, I'd have more > > info > > >> to > > >> > > think about better suggestion ;) > > >> > > > > >> > > > It is Okay to keep all versions of your patch in the JIRA. > > >> > > > Maybe the second should be named HBASE-3811-v2.patch< > > >> > > > > >> > > > >> > > > https://issues.apache.org/jira/secure/attachment/12478694/HBASE-3811.patch > > >> > > >? > > >> > > > > >> > > np. Can do that. Just thought that they (patches) can be sorted by > > >> date > > >> > to > > >> > > find out the final one (aka "convention over naming-rules"). > > >> > > > > >> > > Alex. > > >> > > > > >> > > On Wed, May 11, 2011 at 11:13 PM, Ted Yu <[email protected]> > > wrote: > > >> > > > > >> > > > >> Though it might be ok, since we anyways "transfer" start/stop > > >> rows > > >> > > with > > >> > > > Scan object. > > >> > > > In write() method, we now have: > > >> > > > Bytes.writeByteArray(out, this.startRow); > > >> > > > Bytes.writeByteArray(out, this.stopRow); > > >> > > > ... > > >> > > > for (Map.Entry<String, byte[]> attr : > > >> this.attributes.entrySet()) > > >> > { > > >> > > > WritableUtils.writeString(out, attr.getKey()); > > >> > > > Bytes.writeByteArray(out, attr.getValue()); > > >> > > > } > > >> > > > The start/end rows may be written twice. > > >> > > > > > >> > > > Of course, you have full control over how to generate the unique > > ID > > >> for > > >> > > > "sourceScan" attribute. > > >> > > > > > >> > > > It is Okay to keep all versions of your patch in the JIRA. Maybe > > the > > >> > > second > > >> > > > should be named HBASE-3811-v2.patch< > > >> > > > > >> > > > >> > > > https://issues.apache.org/jira/secure/attachment/12478694/HBASE-3811.patch > > >> > > >? > > >> > > > > > >> > > > Thanks > > >> > > > > > >> > > > > > >> > > > On Wed, May 11, 2011 at 1:01 PM, Alex Baranau < > > >> > [email protected] > > >> > > >wrote: > > >> > > > > > >> > > >> > Can you remove the first version ? > > >> > > >> Isn't it ok to keep it in JIRA issue? > > >> > > >> > > >> > > >> > > >> > > >> > In HBaseWD, can you use reflection to detect whether Scan > > >> supports > > >> > > >> setAttribute() ? > > >> > > >> > If it does, can you encode start row and end row as > > "sourceScan" > > >> > > >> attribute ? > > >> > > >> > > >> > > >> Yeah, smth like this is going to be implemented. Though I'd > still > > >> want > > >> > > to > > >> > > >> hear from the devs the story about Scan version. > > >> > > >> > > >> > > >> > > >> > > >> > One consideration is that start row or end row may be quite > > long. > > >> > > >> > > >> > > >> Yeah, that is was my though too at first. Though it might be > ok, > > >> since > > >> > > we > > >> > > >> anyways "transfer" start/stop rows with Scan object. > > >> > > >> > > >> > > >> > What do you think ? > > >> > > >> > > >> > > >> I'd love to hear from you is this variant I mentioned is what > we > > >> are > > >> > > >> looking at here: > > >> > > >> > > >> > > >> > > >> > > >> > From what I understand, you want to distinguish scans fired > by > > >> the > > >> > > same > > >> > > >> distributed scan. > > >> > > >> > I.e. group scans which were fired by single distributed scan. > > If > > >> > > that's > > >> > > >> what you want, distributed > > >> > > >> > scan can generate unique ID and set, say "sourceScan" > attribute > > >> to > > >> > its > > >> > > >> value. This way we'll > > >> > > >> > have <# of distinct "sourceScan" attribute values> = <number > of > > >> > > >> distributed scans invoked by > > >> > > >> > client side> and two scans on server side will have the same > > >> > > >> "sourceScan" attribute iff they > > >> > > >> > "belong" to same distributed scan. > > >> > > >> > > >> > > >> > > >> > > >> Alex Baranau > > >> > > >> ---- > > >> > > >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - > > Hadoop > > >> - > > >> > > >> HBase > > >> > > >> > > >> > > >> On Wed, May 11, 2011 at 5:15 PM, Ted Yu <[email protected]> > > >> wrote: > > >> > > >> > > >> > > >>> Alex: > > >> > > >>> Your second patch looks good. > > >> > > >>> Can you remove the first version ? > > >> > > >>> > > >> > > >>> In HBaseWD, can you use reflection to detect whether Scan > > supports > > >> > > >>> setAttribute() ? > > >> > > >>> If it does, can you encode start row and end row as > "sourceScan" > > >> > > >>> attribute ? > > >> > > >>> > > >> > > >>> One consideration is that start row or end row may be quite > > long. > > >> > > >>> Ideally we should store hash code of source Scan object as > > >> > "sourceScan" > > >> > > >>> attribute. But Scan doesn't implement hashCode(). We can add > it, > > >> that > > >> > > would > > >> > > >>> require running all Scan related tests. > > >> > > >>> > > >> > > >>> What do you think ? > > >> > > >>> > > >> > > >>> Thanks > > >> > > >>> > > >> > > >>> > > >> > > >>> On Tue, May 10, 2011 at 5:46 AM, Alex Baranau < > > >> > > [email protected]>wrote: > > >> > > >>> > > >> > > >>>> Sorry for the delay in response (public holidays here). > > >> > > >>>> > > >> > > >>>> This depends on what info you are looking for on server side. > > >> > > >>>> > > >> > > >>>> From what I understand, you want to distinguish scans fired > by > > >> the > > >> > > same > > >> > > >>>> distributed scan. I.e. group scans which were fired by single > > >> > > distributed > > >> > > >>>> scan. If that's what you want, distributed scan can generate > > >> unique > > >> > ID > > >> > > and > > >> > > >>>> set, say "sourceScan" attribute to its value. This way we'll > > have > > >> <# > > >> > > of > > >> > > >>>> distinct "sourceScan" attribute values> = <number of > > distributed > > >> > scans > > >> > > >>>> invoked by client side> and two scans on server side will > have > > >> the > > >> > > same > > >> > > >>>> "sourceScan" attribute iff they "belong" to same distributed > > >> scan. > > >> > > >>>> > > >> > > >>>> Is this what are you looking for? > > >> > > >>>> > > >> > > >>>> Alex Baranau > > >> > > >>>> > > >> > > >>>> P.S. attached patch for HBASE-3811< > > >> > > https://issues.apache.org/jira/browse/HBASE-3811> > > >> > > >>>> . > > >> > > >>>> P.S-2. should this conversation be moved to dev list? > > >> > > >>>> > > >> > > >>>> ---- > > >> > > >>>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - > > >> Hadoop > > >> > - > > >> > > >>>> HBase > > >> > > >>>> > > >> > > >>>> On Fri, May 6, 2011 at 12:06 AM, Ted Yu <[email protected] > > > > >> > wrote: > > >> > > >>>> > > >> > > >>>>> Alex: > > >> > > >>>>> What type of identification should we put in the map of the > > Scan > > >> > > object > > >> > > >>>>> ? > > >> > > >>>>> I am thinking of using the Id of RowKeyDistributor. But the > > user > > >> > can > > >> > > >>>>> use same distributor on multiple scans. > > >> > > >>>>> > > >> > > >>>>> Please share your thought. > > >> > > >>>>> > > >> > > >>>>> > > >> > > >>>>> On Thu, Apr 21, 2011 at 8:32 AM, Alex Baranau < > > >> > > >>>>> [email protected]> wrote: > > >> > > >>>>> > > >> > > >>>>>> https://issues.apache.org/jira/browse/HBASE-3811 > > >> > > >>>>>> > > >> > > >>>>>> Alex Baranau > > >> > > >>>>>> ---- > > >> > > >>>>>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > - > > >> > Hadoop > > >> > > - > > >> > > >>>>>> HBase > > >> > > >>>>>> > > >> > > >>>>>> On Thu, Apr 21, 2011 at 5:57 PM, Ted Yu < > [email protected] > > > > > >> > > wrote: > > >> > > >>>>>> > > >> > > >>>>>> > My plan was to make regions that have active scanners > more > > >> > stable > > >> > > - > > >> > > >>>>>> trying > > >> > > >>>>>> > not to move them when balancing. > > >> > > >>>>>> > I prefer second approach - adding custom attribute(s) to > > Scan > > >> so > > >> > > >>>>>> that the > > >> > > >>>>>> > Scans created by the method below can be 'grouped'. > > >> > > >>>>>> > > > >> > > >>>>>> > If you can file a JIRA, that would be great. > > >> > > >>>>>> > > > >> > > >>>>>> > On Thu, Apr 21, 2011 at 7:23 AM, Alex Baranau < > > >> > > >>>>>> [email protected] > > >> > > >>>>>> > >wrote: > > >> > > >>>>>> > > > >> > > >>>>>> > > Aha, so you want to "count" it as single scan (or just > > >> > > >>>>>> differently) when > > >> > > >>>>>> > > determining the load? > > >> > > >>>>>> > > > > >> > > >>>>>> > > The current code looks like this: > > >> > > >>>>>> > > > > >> > > >>>>>> > > class DistributedScanner: > > >> > > >>>>>> > > public static DistributedScanner create(HTable hTable, > > >> Scan > > >> > > >>>>>> original, > > >> > > >>>>>> > > AbstractRowKeyDistributor keyDistributor) throws > > >> IOException { > > >> > > >>>>>> > > byte[][] startKeys = > > >> > > >>>>>> > > > > >> keyDistributor.getAllDistributedKeys(original.getStartRow()); > > >> > > >>>>>> > > byte[][] stopKeys = > > >> > > >>>>>> > > > > >> keyDistributor.getAllDistributedKeys(original.getStopRow()); > > >> > > >>>>>> > > Scan[] scans = new Scan[startKeys.length]; > > >> > > >>>>>> > > for (byte i = 0; i < startKeys.length; i++) { > > >> > > >>>>>> > > scans[i] = new Scan(original); > > >> > > >>>>>> > > scans[i].setStartRow(startKeys[i]); > > >> > > >>>>>> > > scans[i].setStopRow(stopKeys[i]); > > >> > > >>>>>> > > } > > >> > > >>>>>> > > > > >> > > >>>>>> > > ResultScanner[] rss = new > > >> ResultScanner[startKeys.length]; > > >> > > >>>>>> > > for (byte i = 0; i < scans.length; i++) { > > >> > > >>>>>> > > rss[i] = hTable.getScanner(scans[i]); > > >> > > >>>>>> > > } > > >> > > >>>>>> > > > > >> > > >>>>>> > > return new DistributedScanner(rss); > > >> > > >>>>>> > > } > > >> > > >>>>>> > > > > >> > > >>>>>> > > This is client code. To make these scans "identifiable" > > we > > >> > need > > >> > > to > > >> > > >>>>>> either > > >> > > >>>>>> > > use some different (derived from Scan) class or add > some > > >> > > attribute > > >> > > >>>>>> to > > >> > > >>>>>> > them. > > >> > > >>>>>> > > There's no API for doing the latter. But we can do the > > >> former, > > >> > > but > > >> > > >>>>>> I > > >> > > >>>>>> > don't > > >> > > >>>>>> > > really like the idea of creating extra class (with no > > extra > > >> > > >>>>>> > functionality) > > >> > > >>>>>> > > just to distinguish it from the base one. > > >> > > >>>>>> > > > > >> > > >>>>>> > > If you can share why/how do you want to treat them > > >> differently > > >> > > on > > >> > > >>>>>> server > > >> > > >>>>>> > > side, that would be helpful. > > >> > > >>>>>> > > > > >> > > >>>>>> > > Alex Baranau > > >> > > >>>>>> > > ---- > > >> > > >>>>>> > > Sematext :: http://sematext.com/ :: Solr - Lucene - > > Nutch > > >> - > > >> > > >>>>>> Hadoop - > > >> > > >>>>>> > HBase > > >> > > >>>>>> > > > > >> > > >>>>>> > > On Thu, Apr 21, 2011 at 4:58 PM, Ted Yu < > > >> [email protected]> > > >> > > >>>>>> wrote: > > >> > > >>>>>> > > > > >> > > >>>>>> > > > My request would be to make the distributed scan > > >> > identifiable > > >> > > >>>>>> from > > >> > > >>>>>> > server > > >> > > >>>>>> > > > side. > > >> > > >>>>>> > > > :-) > > >> > > >>>>>> > > > > > >> > > >>>>>> > > > On Thu, Apr 21, 2011 at 5:45 AM, Alex Baranau < > > >> > > >>>>>> > [email protected] > > >> > > >>>>>> > > > >wrote: > > >> > > >>>>>> > > > > > >> > > >>>>>> > > > > > Basically bucketsCount may not equal number of > > >> regions > > >> > for > > >> > > >>>>>> the > > >> > > >>>>>> > > > underlying > > >> > > >>>>>> > > > > > table. > > >> > > >>>>>> > > > > > > >> > > >>>>>> > > > > True: e.g. when there's only one region that holds > > data > > >> > for > > >> > > >>>>>> the whole > > >> > > >>>>>> > > > table > > >> > > >>>>>> > > > > (not many records in table yet), distributed scan > > will > > >> > fire > > >> > > N > > >> > > >>>>>> scans > > >> > > >>>>>> > > > against > > >> > > >>>>>> > > > > the same region. > > >> > > >>>>>> > > > > On the other hand, in case there are huge number of > > >> > regions > > >> > > >>>>>> for > > >> > > >>>>>> > single > > >> > > >>>>>> > > > > table, each scan can span over multiple regions. > > >> > > >>>>>> > > > > > > >> > > >>>>>> > > > > > I need to deal with normal scan and "distributed > > >> scan" > > >> > at > > >> > > >>>>>> server > > >> > > >>>>>> > > side. > > >> > > >>>>>> > > > > > > >> > > >>>>>> > > > > With current implementation "distributed" scan > won't > > be > > >> > > >>>>>> recognized as > > >> > > >>>>>> > > > > something special on the server side. It will be an > > >> > ordinary > > >> > > >>>>>> scan. > > >> > > >>>>>> > > Though > > >> > > >>>>>> > > > > the number of scan will increase, given that the > > >> typical > > >> > > >>>>>> situation is > > >> > > >>>>>> > > > "many > > >> > > >>>>>> > > > > regions for single table", the scans of the same > > >> > > "distributed > > >> > > >>>>>> scan" > > >> > > >>>>>> > are > > >> > > >>>>>> > > > > likely not to hit the same region. > > >> > > >>>>>> > > > > > > >> > > >>>>>> > > > > Not sure if I answered your questions here. Feel > free > > >> to > > >> > ask > > >> > > >>>>>> more ;) > > >> > > >>>>>> > > > > > > >> > > >>>>>> > > > > Alex Baranau > > >> > > >>>>>> > > > > ---- > > >> > > >>>>>> > > > > Sematext :: http://sematext.com/ :: Solr - Lucene > - > > >> Nutch > > >> > - > > >> > > >>>>>> Hadoop - > > >> > > >>>>>> > > > HBase > > >> > > >>>>>> > > > > > > >> > > >>>>>> > > > > On Wed, Apr 20, 2011 at 2:10 PM, Ted Yu < > > >> > > [email protected]> > > >> > > >>>>>> wrote: > > >> > > >>>>>> > > > > > > >> > > >>>>>> > > > > > Alex: > > >> > > >>>>>> > > > > > If you read this, you would know why I asked: > > >> > > >>>>>> > > > > > https://issues.apache.org/jira/browse/HBASE-3679 > > >> > > >>>>>> > > > > > > > >> > > >>>>>> > > > > > I need to deal with normal scan and "distributed > > >> scan" > > >> > at > > >> > > >>>>>> server > > >> > > >>>>>> > > side. > > >> > > >>>>>> > > > > > Basically bucketsCount may not equal number of > > >> regions > > >> > for > > >> > > >>>>>> the > > >> > > >>>>>> > > > underlying > > >> > > >>>>>> > > > > > table. > > >> > > >>>>>> > > > > > > > >> > > >>>>>> > > > > > Cheers > > >> > > >>>>>> > > > > > > > >> > > >>>>>> > > > > > On Tue, Apr 19, 2011 at 11:11 PM, Alex Baranau < > > >> > > >>>>>> > > > [email protected] > > >> > > >>>>>> > > > > > >wrote: > > >> > > >>>>>> > > > > > > > >> > > >>>>>> > > > > > > Hi Ted, > > >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > We currently use this tool in the scenario > where > > >> data > > >> > is > > >> > > >>>>>> consumed > > >> > > >>>>>> > > by > > >> > > >>>>>> > > > > > > MapReduce jobs, so we haven't tested the > > >> performance > > >> > of > > >> > > >>>>>> pure > > >> > > >>>>>> > > > > "distributed > > >> > > >>>>>> > > > > > > scan" (i.e. N scans instead of 1) a lot. I > expect > > >> it > > >> > to > > >> > > be > > >> > > >>>>>> close > > >> > > >>>>>> > to > > >> > > >>>>>> > > > > > simple > > >> > > >>>>>> > > > > > > scan performance, or may be sometimes even > faster > > >> > > >>>>>> depending on > > >> > > >>>>>> > your > > >> > > >>>>>> > > > > data > > >> > > >>>>>> > > > > > > access patterns. E.g. in case you write > > timeseries > > >> > data > > >> > > >>>>>> > > (sequential) > > >> > > >>>>>> > > > > > which > > >> > > >>>>>> > > > > > > is written into the single region at a time, > then > > >> e.g. > > >> > > if > > >> > > >>>>>> you > > >> > > >>>>>> > > access > > >> > > >>>>>> > > > > > delta > > >> > > >>>>>> > > > > > > for further processing/analysis (esp. if from > not > > >> > single > > >> > > >>>>>> client) > > >> > > >>>>>> > > > these > > >> > > >>>>>> > > > > > > scans > > >> > > >>>>>> > > > > > > are likely to hit the same region or couple of > > >> regions > > >> > > at > > >> > > >>>>>> a time, > > >> > > >>>>>> > > > which > > >> > > >>>>>> > > > > > may > > >> > > >>>>>> > > > > > > perform worse comparing to many scans hitting > > data > > >> > that > > >> > > is > > >> > > >>>>>> much > > >> > > >>>>>> > > > better > > >> > > >>>>>> > > > > > > spread over region servers. > > >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > As for map-reduce job the approach should not > > >> affect > > >> > > >>>>>> reading > > >> > > >>>>>> > > > > performance > > >> > > >>>>>> > > > > > at > > >> > > >>>>>> > > > > > > all: it's just that there are bucketsCount > times > > >> more > > >> > > >>>>>> splits and > > >> > > >>>>>> > > > hence > > >> > > >>>>>> > > > > > > bucketsCount times more Map tasks. In many > cases > > >> this > > >> > > even > > >> > > >>>>>> > improves > > >> > > >>>>>> > > > > > overall > > >> > > >>>>>> > > > > > > performance of the MR job since work is better > > >> > > distributed > > >> > > >>>>>> over > > >> > > >>>>>> > > > cluster > > >> > > >>>>>> > > > > > > (esp. in situation when the aim is to > constantly > > >> > process > > >> > > >>>>>> the > > >> > > >>>>>> > coming > > >> > > >>>>>> > > > > delta > > >> > > >>>>>> > > > > > > which usually resides in one or just couple of > > >> regions > > >> > > >>>>>> depending > > >> > > >>>>>> > on > > >> > > >>>>>> > > > > > > processing frequency). > > >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > If you can share details on your case, that > will > > >> help > > >> > to > > >> > > >>>>>> > understand > > >> > > >>>>>> > > > > what > > >> > > >>>>>> > > > > > > effect(s) to expect from using this approach. > > >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > Alex Baranau > > >> > > >>>>>> > > > > > > ---- > > >> > > >>>>>> > > > > > > Sematext :: http://sematext.com/ :: Solr - > > Lucene > > >> - > > >> > > Nutch > > >> > > >>>>>> - > > >> > > >>>>>> > Hadoop > > >> > > >>>>>> > > - > > >> > > >>>>>> > > > > > HBase > > >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > On Wed, Apr 20, 2011 at 8:17 AM, Ted Yu < > > >> > > >>>>>> [email protected]> > > >> > > >>>>>> > > wrote: > > >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > Interesting project, Alex. > > >> > > >>>>>> > > > > > > > Since there're bucketsCount scanners compared > > to > > >> one > > >> > > >>>>>> scanner > > >> > > >>>>>> > > > > > originally, > > >> > > >>>>>> > > > > > > > have you performed load testing to see the > > impact > > >> ? > > >> > > >>>>>> > > > > > > > > > >> > > >>>>>> > > > > > > > Thanks > > >> > > >>>>>> > > > > > > > > > >> > > >>>>>> > > > > > > > On Tue, Apr 19, 2011 at 10:25 AM, Alex > Baranau > > < > > >> > > >>>>>> > > > > > [email protected] > > >> > > >>>>>> > > > > > > > >wrote: > > >> > > >>>>>> > > > > > > > > > >> > > >>>>>> > > > > > > > > Hello guys, > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > I'd like to introduce a new small java > > >> project/lib > > >> > > >>>>>> around > > >> > > >>>>>> > > HBase: > > >> > > >>>>>> > > > > > > HBaseWD. > > >> > > >>>>>> > > > > > > > > It > > >> > > >>>>>> > > > > > > > > is aimed to help with distribution of the > > load > > >> > > (across > > >> > > >>>>>> > > > > regionservers) > > >> > > >>>>>> > > > > > > > when > > >> > > >>>>>> > > > > > > > > writing sequential (becasue of the row key > > >> nature) > > >> > > >>>>>> records. > > >> > > >>>>>> > It > > >> > > >>>>>> > > > > > > implements > > >> > > >>>>>> > > > > > > > > the solution which was discussed several > > times > > >> on > > >> > > this > > >> > > >>>>>> > mailing > > >> > > >>>>>> > > > list > > >> > > >>>>>> > > > > > > (e.g. > > >> > > >>>>>> > > > > > > > > here: > http://search-hadoop.com/m/gNRA82No5Wk > > ). > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > Please find the sources at > > >> > > >>>>>> > > > > > https://github.com/sematext/HBaseWD(there's > > >> > > >>>>>> > > > > > > > > also > > >> > > >>>>>> > > > > > > > > a jar of current version for convenience). > It > > >> is > > >> > > very > > >> > > >>>>>> easy to > > >> > > >>>>>> > > > make > > >> > > >>>>>> > > > > > use > > >> > > >>>>>> > > > > > > of > > >> > > >>>>>> > > > > > > > > it: e.g. I added it to one existing project > > >> with > > >> > 1+2 > > >> > > >>>>>> lines of > > >> > > >>>>>> > > > code > > >> > > >>>>>> > > > > > (one > > >> > > >>>>>> > > > > > > > > where I write to HBase and 2 for > configuring > > >> > > MapReduce > > >> > > >>>>>> job). > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > Any feedback is highly appreciated! > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > Please find below the short intro to the > lib > > >> [1]. > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > Alex Baranau > > >> > > >>>>>> > > > > > > > > ---- > > >> > > >>>>>> > > > > > > > > Sematext :: http://sematext.com/ :: Solr - > > >> Lucene > > >> > - > > >> > > >>>>>> Nutch - > > >> > > >>>>>> > > > Hadoop > > >> > > >>>>>> > > > > - > > >> > > >>>>>> > > > > > > > HBase > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > [1] > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > Description: > > >> > > >>>>>> > > > > > > > > ------------ > > >> > > >>>>>> > > > > > > > > HBaseWD stands for Distributing > (sequential) > > >> > Writes. > > >> > > >>>>>> It was > > >> > > >>>>>> > > > > inspired > > >> > > >>>>>> > > > > > by > > >> > > >>>>>> > > > > > > > > discussions on HBase mailing lists around > the > > >> > > problem > > >> > > >>>>>> of > > >> > > >>>>>> > > choosing > > >> > > >>>>>> > > > > > > > between: > > >> > > >>>>>> > > > > > > > > * writing records with sequential row keys > > >> (e.g. > > >> > > >>>>>> time-series > > >> > > >>>>>> > > data > > >> > > >>>>>> > > > > > with > > >> > > >>>>>> > > > > > > > row > > >> > > >>>>>> > > > > > > > > key > > >> > > >>>>>> > > > > > > > > built based on ts) > > >> > > >>>>>> > > > > > > > > * using random unique IDs for records > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > First approach makes possible to perform > fast > > >> > range > > >> > > >>>>>> scans > > >> > > >>>>>> > with > > >> > > >>>>>> > > > help > > >> > > >>>>>> > > > > > of > > >> > > >>>>>> > > > > > > > > setting > > >> > > >>>>>> > > > > > > > > start/stop keys on Scanner, but creates > > single > > >> > > region > > >> > > >>>>>> server > > >> > > >>>>>> > > > > > > hot-spotting > > >> > > >>>>>> > > > > > > > > problem upon writing data (as row keys go > in > > >> > > sequence > > >> > > >>>>>> all > > >> > > >>>>>> > > records > > >> > > >>>>>> > > > > end > > >> > > >>>>>> > > > > > > up > > >> > > >>>>>> > > > > > > > > written into a single region at a time). > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > Second approach aims for fastest writing > > >> > performance > > >> > > >>>>>> by > > >> > > >>>>>> > > > > distributing > > >> > > >>>>>> > > > > > > new > > >> > > >>>>>> > > > > > > > > records over random regions but makes not > > >> possible > > >> > > >>>>>> doing fast > > >> > > >>>>>> > > > range > > >> > > >>>>>> > > > > > > scans > > >> > > >>>>>> > > > > > > > > against written data. > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > The suggested approach stays in the middle > of > > >> the > > >> > > two > > >> > > >>>>>> above > > >> > > >>>>>> > and > > >> > > >>>>>> > > > > > proved > > >> > > >>>>>> > > > > > > to > > >> > > >>>>>> > > > > > > > > perform well by distributing records over > the > > >> > > cluster > > >> > > >>>>>> during > > >> > > >>>>>> > > > > writing > > >> > > >>>>>> > > > > > > data > > >> > > >>>>>> > > > > > > > > while allowing range scans over it. HBaseWD > > >> > provides > > >> > > >>>>>> very > > >> > > >>>>>> > > simple > > >> > > >>>>>> > > > > API > > >> > > >>>>>> > > > > > to > > >> > > >>>>>> > > > > > > > > work with which makes it perfect to use > with > > >> > > existing > > >> > > >>>>>> code. > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > Please refer to unit-tests for lib usage > info > > >> as > > >> > > they > > >> > > >>>>>> aimed > > >> > > >>>>>> > to > > >> > > >>>>>> > > > act > > >> > > >>>>>> > > > > as > > >> > > >>>>>> > > > > > > > > example. > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > Brief Usage Info (Examples): > > >> > > >>>>>> > > > > > > > > ---------------------------- > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > Distributing records with sequential keys > > which > > >> > are > > >> > > >>>>>> being > > >> > > >>>>>> > > written > > >> > > >>>>>> > > > > in > > >> > > >>>>>> > > > > > up > > >> > > >>>>>> > > > > > > > to > > >> > > >>>>>> > > > > > > > > Byte.MAX_VALUE buckets: > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > byte bucketsCount = (byte) 32; // > > >> distributing > > >> > > into > > >> > > >>>>>> 32 > > >> > > >>>>>> > > buckets > > >> > > >>>>>> > > > > > > > > RowKeyDistributor keyDistributor = > > >> > > >>>>>> > > > > > > > > new > > >> > > >>>>>> > > > > > > > > > > RowKeyDistributorByOneBytePrefix(bucketsCount); > > >> > > >>>>>> > > > > > > > > for (int i = 0; i < 100; i++) { > > >> > > >>>>>> > > > > > > > > Put put = new > > >> > > >>>>>> > > > > > > Put(keyDistributor.getDistributedKey(originalKey)); > > >> > > >>>>>> > > > > > > > > ... // add values > > >> > > >>>>>> > > > > > > > > hTable.put(put); > > >> > > >>>>>> > > > > > > > > } > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > Performing a range scan over written data > > >> > > (internally > > >> > > >>>>>> > > > > <bucketsCount> > > >> > > >>>>>> > > > > > > > > scanners > > >> > > >>>>>> > > > > > > > > executed): > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > Scan scan = new Scan(startKey, stopKey); > > >> > > >>>>>> > > > > > > > > ResultScanner rs = > > >> > > >>>>>> DistributedScanner.create(hTable, scan, > > >> > > >>>>>> > > > > > > > > keyDistributor); > > >> > > >>>>>> > > > > > > > > for (Result current : rs) { > > >> > > >>>>>> > > > > > > > > ... > > >> > > >>>>>> > > > > > > > > } > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > Performing mapreduce job over written data > > >> chunk > > >> > > >>>>>> specified by > > >> > > >>>>>> > > > Scan: > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > Configuration conf = > > >> > HBaseConfiguration.create(); > > >> > > >>>>>> > > > > > > > > Job job = new Job(conf, > > "testMapreduceJob"); > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > Scan scan = new Scan(startKey, stopKey); > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > > > >> TableMapReduceUtil.initTableMapperJob("table", > > >> > > >>>>>> scan, > > >> > > >>>>>> > > > > > > > > RowCounterMapper.class, > > >> > > >>>>>> ImmutableBytesWritable.class, > > >> > > >>>>>> > > > > > > Result.class, > > >> > > >>>>>> > > > > > > > > job); > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > // Substituting standard > TableInputFormat > > >> which > > >> > > was > > >> > > >>>>>> set in > > >> > > >>>>>> > > > > > > > > // > > >> TableMapReduceUtil.initTableMapperJob(...) > > >> > > >>>>>> > > > > > > > > > > >> > > job.setInputFormatClass(WdTableInputFormat.class); > > >> > > >>>>>> > > > > > > > > > > >> keyDistributor.addInfo(job.getConfiguration()); > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > Extending Row Keys Distributing Patterns: > > >> > > >>>>>> > > > > > > > > ----------------------------------------- > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > HBaseWD is designed to be flexible and to > > >> support > > >> > > >>>>>> custom row > > >> > > >>>>>> > > key > > >> > > >>>>>> > > > > > > > > distribution > > >> > > >>>>>> > > > > > > > > approaches. To define custom row key > > >> distributing > > >> > > >>>>>> logic just > > >> > > >>>>>> > > > > > implement > > >> > > >>>>>> > > > > > > > > AbstractRowKeyDistributor abstract class > > which > > >> is > > >> > > >>>>>> really very > > >> > > >>>>>> > > > > simple: > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > public abstract class > > >> AbstractRowKeyDistributor > > >> > > >>>>>> implements > > >> > > >>>>>> > > > > > > > > Parametrizable { > > >> > > >>>>>> > > > > > > > > public abstract byte[] > > >> > getDistributedKey(byte[] > > >> > > >>>>>> > > > originalKey); > > >> > > >>>>>> > > > > > > > > public abstract byte[] > > >> getOriginalKey(byte[] > > >> > > >>>>>> > adjustedKey); > > >> > > >>>>>> > > > > > > > > public abstract byte[][] > > >> > > >>>>>> getAllDistributedKeys(byte[] > > >> > > >>>>>> > > > > > > originalKey); > > >> > > >>>>>> > > > > > > > > ... // some utility methods > > >> > > >>>>>> > > > > > > > > } > > >> > > >>>>>> > > > > > > > > > > >> > > >>>>>> > > > > > > > > > >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > >> > > >>>>>> > > > > > > >> > > >>>>>> > > > > > >> > > >>>>>> > > > > >> > > >>>>>> > > > >> > > >>>>>> > > >> > > >>>>> > > >> > > >>>>> > > >> > > >>>> > > >> > > >>> > > >> > > >> > > >> > > > > > >> > > > > >> > > > >> > > > > > > > > >
