I have another question. For overwriting, do I need to delete the existing one before re-writing it?
On Sat, May 14, 2011 at 10:17 AM, Weishung Chung <[email protected]> wrote: > Yes, it's simple yet useful. I am integrating it. Thanks alot :) > > > On Fri, May 13, 2011 at 3:12 PM, Alex Baranau <[email protected]>wrote: > >> Thanks for the interest! >> >> We are using it in production. It is simple and hence quite stable. Though >> some minor pieces are missing (like >> https://github.com/sematext/HBaseWD/issues/1) this doesn't affect >> stability >> and/or major functionality. >> >> Alex Baranau >> ---- >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - >> HBase >> >> On Fri, May 13, 2011 at 10:45 AM, Weishung Chung <[email protected]> >> wrote: >> >> > What's the status on this package? Is it mature enough? >> > I am using it in my project, tried out the write method yesterday and >> > going >> > to incorporate into read method tomorrow. >> > >> > On Wed, May 11, 2011 at 3:41 PM, Alex Baranau <[email protected] >> > >wrote: >> > >> > > > The start/end rows may be written twice. >> > > >> > > Yeah, I know. I meant that size of startRow+stopRow data is "bearable" >> in >> > > attribute value no matter how long are they (keys), since we already >> OK >> > > with >> > > transferring them initially (i.e. we should be OK with transferring 2x >> > > times >> > > more). >> > > >> > > So, what about the suggestion of sourceScan attribute value I >> mentioned? >> > If >> > > you can tell why it isn't sufficient in your case, I'd have more info >> to >> > > think about better suggestion ;) >> > > >> > > > It is Okay to keep all versions of your patch in the JIRA. >> > > > Maybe the second should be named HBASE-3811-v2.patch< >> > > >> > >> https://issues.apache.org/jira/secure/attachment/12478694/HBASE-3811.patch >> > > >? >> > > >> > > np. Can do that. Just thought that they (patches) can be sorted by >> date >> > to >> > > find out the final one (aka "convention over naming-rules"). >> > > >> > > Alex. >> > > >> > > On Wed, May 11, 2011 at 11:13 PM, Ted Yu <[email protected]> wrote: >> > > >> > > > >> Though it might be ok, since we anyways "transfer" start/stop >> rows >> > > with >> > > > Scan object. >> > > > In write() method, we now have: >> > > > Bytes.writeByteArray(out, this.startRow); >> > > > Bytes.writeByteArray(out, this.stopRow); >> > > > ... >> > > > for (Map.Entry<String, byte[]> attr : >> this.attributes.entrySet()) >> > { >> > > > WritableUtils.writeString(out, attr.getKey()); >> > > > Bytes.writeByteArray(out, attr.getValue()); >> > > > } >> > > > The start/end rows may be written twice. >> > > > >> > > > Of course, you have full control over how to generate the unique ID >> for >> > > > "sourceScan" attribute. >> > > > >> > > > It is Okay to keep all versions of your patch in the JIRA. Maybe the >> > > second >> > > > should be named HBASE-3811-v2.patch< >> > > >> > >> https://issues.apache.org/jira/secure/attachment/12478694/HBASE-3811.patch >> > > >? >> > > > >> > > > Thanks >> > > > >> > > > >> > > > On Wed, May 11, 2011 at 1:01 PM, Alex Baranau < >> > [email protected] >> > > >wrote: >> > > > >> > > >> > Can you remove the first version ? >> > > >> Isn't it ok to keep it in JIRA issue? >> > > >> >> > > >> >> > > >> > In HBaseWD, can you use reflection to detect whether Scan >> supports >> > > >> setAttribute() ? >> > > >> > If it does, can you encode start row and end row as "sourceScan" >> > > >> attribute ? >> > > >> >> > > >> Yeah, smth like this is going to be implemented. Though I'd still >> want >> > > to >> > > >> hear from the devs the story about Scan version. >> > > >> >> > > >> >> > > >> > One consideration is that start row or end row may be quite long. >> > > >> >> > > >> Yeah, that is was my though too at first. Though it might be ok, >> since >> > > we >> > > >> anyways "transfer" start/stop rows with Scan object. >> > > >> >> > > >> > What do you think ? >> > > >> >> > > >> I'd love to hear from you is this variant I mentioned is what we >> are >> > > >> looking at here: >> > > >> >> > > >> >> > > >> > From what I understand, you want to distinguish scans fired by >> the >> > > same >> > > >> distributed scan. >> > > >> > I.e. group scans which were fired by single distributed scan. If >> > > that's >> > > >> what you want, distributed >> > > >> > scan can generate unique ID and set, say "sourceScan" attribute >> to >> > its >> > > >> value. This way we'll >> > > >> > have <# of distinct "sourceScan" attribute values> = <number of >> > > >> distributed scans invoked by >> > > >> > client side> and two scans on server side will have the same >> > > >> "sourceScan" attribute iff they >> > > >> > "belong" to same distributed scan. >> > > >> >> > > >> >> > > >> Alex Baranau >> > > >> ---- >> > > >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop >> - >> > > >> HBase >> > > >> >> > > >> On Wed, May 11, 2011 at 5:15 PM, Ted Yu <[email protected]> >> wrote: >> > > >> >> > > >>> Alex: >> > > >>> Your second patch looks good. >> > > >>> Can you remove the first version ? >> > > >>> >> > > >>> In HBaseWD, can you use reflection to detect whether Scan supports >> > > >>> setAttribute() ? >> > > >>> If it does, can you encode start row and end row as "sourceScan" >> > > >>> attribute ? >> > > >>> >> > > >>> One consideration is that start row or end row may be quite long. >> > > >>> Ideally we should store hash code of source Scan object as >> > "sourceScan" >> > > >>> attribute. But Scan doesn't implement hashCode(). We can add it, >> that >> > > would >> > > >>> require running all Scan related tests. >> > > >>> >> > > >>> What do you think ? >> > > >>> >> > > >>> Thanks >> > > >>> >> > > >>> >> > > >>> On Tue, May 10, 2011 at 5:46 AM, Alex Baranau < >> > > [email protected]>wrote: >> > > >>> >> > > >>>> Sorry for the delay in response (public holidays here). >> > > >>>> >> > > >>>> This depends on what info you are looking for on server side. >> > > >>>> >> > > >>>> From what I understand, you want to distinguish scans fired by >> the >> > > same >> > > >>>> distributed scan. I.e. group scans which were fired by single >> > > distributed >> > > >>>> scan. If that's what you want, distributed scan can generate >> unique >> > ID >> > > and >> > > >>>> set, say "sourceScan" attribute to its value. This way we'll have >> <# >> > > of >> > > >>>> distinct "sourceScan" attribute values> = <number of distributed >> > scans >> > > >>>> invoked by client side> and two scans on server side will have >> the >> > > same >> > > >>>> "sourceScan" attribute iff they "belong" to same distributed >> scan. >> > > >>>> >> > > >>>> Is this what are you looking for? >> > > >>>> >> > > >>>> Alex Baranau >> > > >>>> >> > > >>>> P.S. attached patch for HBASE-3811< >> > > https://issues.apache.org/jira/browse/HBASE-3811> >> > > >>>> . >> > > >>>> P.S-2. should this conversation be moved to dev list? >> > > >>>> >> > > >>>> ---- >> > > >>>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - >> Hadoop >> > - >> > > >>>> HBase >> > > >>>> >> > > >>>> On Fri, May 6, 2011 at 12:06 AM, Ted Yu <[email protected]> >> > wrote: >> > > >>>> >> > > >>>>> Alex: >> > > >>>>> What type of identification should we put in the map of the Scan >> > > object >> > > >>>>> ? >> > > >>>>> I am thinking of using the Id of RowKeyDistributor. But the user >> > can >> > > >>>>> use same distributor on multiple scans. >> > > >>>>> >> > > >>>>> Please share your thought. >> > > >>>>> >> > > >>>>> >> > > >>>>> On Thu, Apr 21, 2011 at 8:32 AM, Alex Baranau < >> > > >>>>> [email protected]> wrote: >> > > >>>>> >> > > >>>>>> https://issues.apache.org/jira/browse/HBASE-3811 >> > > >>>>>> >> > > >>>>>> Alex Baranau >> > > >>>>>> ---- >> > > >>>>>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - >> > Hadoop >> > > - >> > > >>>>>> HBase >> > > >>>>>> >> > > >>>>>> On Thu, Apr 21, 2011 at 5:57 PM, Ted Yu <[email protected]> >> > > wrote: >> > > >>>>>> >> > > >>>>>> > My plan was to make regions that have active scanners more >> > stable >> > > - >> > > >>>>>> trying >> > > >>>>>> > not to move them when balancing. >> > > >>>>>> > I prefer second approach - adding custom attribute(s) to Scan >> so >> > > >>>>>> that the >> > > >>>>>> > Scans created by the method below can be 'grouped'. >> > > >>>>>> > >> > > >>>>>> > If you can file a JIRA, that would be great. >> > > >>>>>> > >> > > >>>>>> > On Thu, Apr 21, 2011 at 7:23 AM, Alex Baranau < >> > > >>>>>> [email protected] >> > > >>>>>> > >wrote: >> > > >>>>>> > >> > > >>>>>> > > Aha, so you want to "count" it as single scan (or just >> > > >>>>>> differently) when >> > > >>>>>> > > determining the load? >> > > >>>>>> > > >> > > >>>>>> > > The current code looks like this: >> > > >>>>>> > > >> > > >>>>>> > > class DistributedScanner: >> > > >>>>>> > > public static DistributedScanner create(HTable hTable, >> Scan >> > > >>>>>> original, >> > > >>>>>> > > AbstractRowKeyDistributor keyDistributor) throws >> IOException { >> > > >>>>>> > > byte[][] startKeys = >> > > >>>>>> > > >> keyDistributor.getAllDistributedKeys(original.getStartRow()); >> > > >>>>>> > > byte[][] stopKeys = >> > > >>>>>> > > >> keyDistributor.getAllDistributedKeys(original.getStopRow()); >> > > >>>>>> > > Scan[] scans = new Scan[startKeys.length]; >> > > >>>>>> > > for (byte i = 0; i < startKeys.length; i++) { >> > > >>>>>> > > scans[i] = new Scan(original); >> > > >>>>>> > > scans[i].setStartRow(startKeys[i]); >> > > >>>>>> > > scans[i].setStopRow(stopKeys[i]); >> > > >>>>>> > > } >> > > >>>>>> > > >> > > >>>>>> > > ResultScanner[] rss = new >> ResultScanner[startKeys.length]; >> > > >>>>>> > > for (byte i = 0; i < scans.length; i++) { >> > > >>>>>> > > rss[i] = hTable.getScanner(scans[i]); >> > > >>>>>> > > } >> > > >>>>>> > > >> > > >>>>>> > > return new DistributedScanner(rss); >> > > >>>>>> > > } >> > > >>>>>> > > >> > > >>>>>> > > This is client code. To make these scans "identifiable" we >> > need >> > > to >> > > >>>>>> either >> > > >>>>>> > > use some different (derived from Scan) class or add some >> > > attribute >> > > >>>>>> to >> > > >>>>>> > them. >> > > >>>>>> > > There's no API for doing the latter. But we can do the >> former, >> > > but >> > > >>>>>> I >> > > >>>>>> > don't >> > > >>>>>> > > really like the idea of creating extra class (with no extra >> > > >>>>>> > functionality) >> > > >>>>>> > > just to distinguish it from the base one. >> > > >>>>>> > > >> > > >>>>>> > > If you can share why/how do you want to treat them >> differently >> > > on >> > > >>>>>> server >> > > >>>>>> > > side, that would be helpful. >> > > >>>>>> > > >> > > >>>>>> > > Alex Baranau >> > > >>>>>> > > ---- >> > > >>>>>> > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch >> - >> > > >>>>>> Hadoop - >> > > >>>>>> > HBase >> > > >>>>>> > > >> > > >>>>>> > > On Thu, Apr 21, 2011 at 4:58 PM, Ted Yu < >> [email protected]> >> > > >>>>>> wrote: >> > > >>>>>> > > >> > > >>>>>> > > > My request would be to make the distributed scan >> > identifiable >> > > >>>>>> from >> > > >>>>>> > server >> > > >>>>>> > > > side. >> > > >>>>>> > > > :-) >> > > >>>>>> > > > >> > > >>>>>> > > > On Thu, Apr 21, 2011 at 5:45 AM, Alex Baranau < >> > > >>>>>> > [email protected] >> > > >>>>>> > > > >wrote: >> > > >>>>>> > > > >> > > >>>>>> > > > > > Basically bucketsCount may not equal number of >> regions >> > for >> > > >>>>>> the >> > > >>>>>> > > > underlying >> > > >>>>>> > > > > > table. >> > > >>>>>> > > > > >> > > >>>>>> > > > > True: e.g. when there's only one region that holds data >> > for >> > > >>>>>> the whole >> > > >>>>>> > > > table >> > > >>>>>> > > > > (not many records in table yet), distributed scan will >> > fire >> > > N >> > > >>>>>> scans >> > > >>>>>> > > > against >> > > >>>>>> > > > > the same region. >> > > >>>>>> > > > > On the other hand, in case there are huge number of >> > regions >> > > >>>>>> for >> > > >>>>>> > single >> > > >>>>>> > > > > table, each scan can span over multiple regions. >> > > >>>>>> > > > > >> > > >>>>>> > > > > > I need to deal with normal scan and "distributed >> scan" >> > at >> > > >>>>>> server >> > > >>>>>> > > side. >> > > >>>>>> > > > > >> > > >>>>>> > > > > With current implementation "distributed" scan won't be >> > > >>>>>> recognized as >> > > >>>>>> > > > > something special on the server side. It will be an >> > ordinary >> > > >>>>>> scan. >> > > >>>>>> > > Though >> > > >>>>>> > > > > the number of scan will increase, given that the >> typical >> > > >>>>>> situation is >> > > >>>>>> > > > "many >> > > >>>>>> > > > > regions for single table", the scans of the same >> > > "distributed >> > > >>>>>> scan" >> > > >>>>>> > are >> > > >>>>>> > > > > likely not to hit the same region. >> > > >>>>>> > > > > >> > > >>>>>> > > > > Not sure if I answered your questions here. Feel free >> to >> > ask >> > > >>>>>> more ;) >> > > >>>>>> > > > > >> > > >>>>>> > > > > Alex Baranau >> > > >>>>>> > > > > ---- >> > > >>>>>> > > > > Sematext :: http://sematext.com/ :: Solr - Lucene - >> Nutch >> > - >> > > >>>>>> Hadoop - >> > > >>>>>> > > > HBase >> > > >>>>>> > > > > >> > > >>>>>> > > > > On Wed, Apr 20, 2011 at 2:10 PM, Ted Yu < >> > > [email protected]> >> > > >>>>>> wrote: >> > > >>>>>> > > > > >> > > >>>>>> > > > > > Alex: >> > > >>>>>> > > > > > If you read this, you would know why I asked: >> > > >>>>>> > > > > > https://issues.apache.org/jira/browse/HBASE-3679 >> > > >>>>>> > > > > > >> > > >>>>>> > > > > > I need to deal with normal scan and "distributed >> scan" >> > at >> > > >>>>>> server >> > > >>>>>> > > side. >> > > >>>>>> > > > > > Basically bucketsCount may not equal number of >> regions >> > for >> > > >>>>>> the >> > > >>>>>> > > > underlying >> > > >>>>>> > > > > > table. >> > > >>>>>> > > > > > >> > > >>>>>> > > > > > Cheers >> > > >>>>>> > > > > > >> > > >>>>>> > > > > > On Tue, Apr 19, 2011 at 11:11 PM, Alex Baranau < >> > > >>>>>> > > > [email protected] >> > > >>>>>> > > > > > >wrote: >> > > >>>>>> > > > > > >> > > >>>>>> > > > > > > Hi Ted, >> > > >>>>>> > > > > > > >> > > >>>>>> > > > > > > We currently use this tool in the scenario where >> data >> > is >> > > >>>>>> consumed >> > > >>>>>> > > by >> > > >>>>>> > > > > > > MapReduce jobs, so we haven't tested the >> performance >> > of >> > > >>>>>> pure >> > > >>>>>> > > > > "distributed >> > > >>>>>> > > > > > > scan" (i.e. N scans instead of 1) a lot. I expect >> it >> > to >> > > be >> > > >>>>>> close >> > > >>>>>> > to >> > > >>>>>> > > > > > simple >> > > >>>>>> > > > > > > scan performance, or may be sometimes even faster >> > > >>>>>> depending on >> > > >>>>>> > your >> > > >>>>>> > > > > data >> > > >>>>>> > > > > > > access patterns. E.g. in case you write timeseries >> > data >> > > >>>>>> > > (sequential) >> > > >>>>>> > > > > > which >> > > >>>>>> > > > > > > is written into the single region at a time, then >> e.g. >> > > if >> > > >>>>>> you >> > > >>>>>> > > access >> > > >>>>>> > > > > > delta >> > > >>>>>> > > > > > > for further processing/analysis (esp. if from not >> > single >> > > >>>>>> client) >> > > >>>>>> > > > these >> > > >>>>>> > > > > > > scans >> > > >>>>>> > > > > > > are likely to hit the same region or couple of >> regions >> > > at >> > > >>>>>> a time, >> > > >>>>>> > > > which >> > > >>>>>> > > > > > may >> > > >>>>>> > > > > > > perform worse comparing to many scans hitting data >> > that >> > > is >> > > >>>>>> much >> > > >>>>>> > > > better >> > > >>>>>> > > > > > > spread over region servers. >> > > >>>>>> > > > > > > >> > > >>>>>> > > > > > > As for map-reduce job the approach should not >> affect >> > > >>>>>> reading >> > > >>>>>> > > > > performance >> > > >>>>>> > > > > > at >> > > >>>>>> > > > > > > all: it's just that there are bucketsCount times >> more >> > > >>>>>> splits and >> > > >>>>>> > > > hence >> > > >>>>>> > > > > > > bucketsCount times more Map tasks. In many cases >> this >> > > even >> > > >>>>>> > improves >> > > >>>>>> > > > > > overall >> > > >>>>>> > > > > > > performance of the MR job since work is better >> > > distributed >> > > >>>>>> over >> > > >>>>>> > > > cluster >> > > >>>>>> > > > > > > (esp. in situation when the aim is to constantly >> > process >> > > >>>>>> the >> > > >>>>>> > coming >> > > >>>>>> > > > > delta >> > > >>>>>> > > > > > > which usually resides in one or just couple of >> regions >> > > >>>>>> depending >> > > >>>>>> > on >> > > >>>>>> > > > > > > processing frequency). >> > > >>>>>> > > > > > > >> > > >>>>>> > > > > > > If you can share details on your case, that will >> help >> > to >> > > >>>>>> > understand >> > > >>>>>> > > > > what >> > > >>>>>> > > > > > > effect(s) to expect from using this approach. >> > > >>>>>> > > > > > > >> > > >>>>>> > > > > > > Alex Baranau >> > > >>>>>> > > > > > > ---- >> > > >>>>>> > > > > > > Sematext :: http://sematext.com/ :: Solr - Lucene >> - >> > > Nutch >> > > >>>>>> - >> > > >>>>>> > Hadoop >> > > >>>>>> > > - >> > > >>>>>> > > > > > HBase >> > > >>>>>> > > > > > > >> > > >>>>>> > > > > > > On Wed, Apr 20, 2011 at 8:17 AM, Ted Yu < >> > > >>>>>> [email protected]> >> > > >>>>>> > > wrote: >> > > >>>>>> > > > > > > >> > > >>>>>> > > > > > > > Interesting project, Alex. >> > > >>>>>> > > > > > > > Since there're bucketsCount scanners compared to >> one >> > > >>>>>> scanner >> > > >>>>>> > > > > > originally, >> > > >>>>>> > > > > > > > have you performed load testing to see the impact >> ? >> > > >>>>>> > > > > > > > >> > > >>>>>> > > > > > > > Thanks >> > > >>>>>> > > > > > > > >> > > >>>>>> > > > > > > > On Tue, Apr 19, 2011 at 10:25 AM, Alex Baranau < >> > > >>>>>> > > > > > [email protected] >> > > >>>>>> > > > > > > > >wrote: >> > > >>>>>> > > > > > > > >> > > >>>>>> > > > > > > > > Hello guys, >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > I'd like to introduce a new small java >> project/lib >> > > >>>>>> around >> > > >>>>>> > > HBase: >> > > >>>>>> > > > > > > HBaseWD. >> > > >>>>>> > > > > > > > > It >> > > >>>>>> > > > > > > > > is aimed to help with distribution of the load >> > > (across >> > > >>>>>> > > > > regionservers) >> > > >>>>>> > > > > > > > when >> > > >>>>>> > > > > > > > > writing sequential (becasue of the row key >> nature) >> > > >>>>>> records. >> > > >>>>>> > It >> > > >>>>>> > > > > > > implements >> > > >>>>>> > > > > > > > > the solution which was discussed several times >> on >> > > this >> > > >>>>>> > mailing >> > > >>>>>> > > > list >> > > >>>>>> > > > > > > (e.g. >> > > >>>>>> > > > > > > > > here: http://search-hadoop.com/m/gNRA82No5Wk). >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > Please find the sources at >> > > >>>>>> > > > > > https://github.com/sematext/HBaseWD(there's >> > > >>>>>> > > > > > > > > also >> > > >>>>>> > > > > > > > > a jar of current version for convenience). It >> is >> > > very >> > > >>>>>> easy to >> > > >>>>>> > > > make >> > > >>>>>> > > > > > use >> > > >>>>>> > > > > > > of >> > > >>>>>> > > > > > > > > it: e.g. I added it to one existing project >> with >> > 1+2 >> > > >>>>>> lines of >> > > >>>>>> > > > code >> > > >>>>>> > > > > > (one >> > > >>>>>> > > > > > > > > where I write to HBase and 2 for configuring >> > > MapReduce >> > > >>>>>> job). >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > Any feedback is highly appreciated! >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > Please find below the short intro to the lib >> [1]. >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > Alex Baranau >> > > >>>>>> > > > > > > > > ---- >> > > >>>>>> > > > > > > > > Sematext :: http://sematext.com/ :: Solr - >> Lucene >> > - >> > > >>>>>> Nutch - >> > > >>>>>> > > > Hadoop >> > > >>>>>> > > > > - >> > > >>>>>> > > > > > > > HBase >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > [1] >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > Description: >> > > >>>>>> > > > > > > > > ------------ >> > > >>>>>> > > > > > > > > HBaseWD stands for Distributing (sequential) >> > Writes. >> > > >>>>>> It was >> > > >>>>>> > > > > inspired >> > > >>>>>> > > > > > by >> > > >>>>>> > > > > > > > > discussions on HBase mailing lists around the >> > > problem >> > > >>>>>> of >> > > >>>>>> > > choosing >> > > >>>>>> > > > > > > > between: >> > > >>>>>> > > > > > > > > * writing records with sequential row keys >> (e.g. >> > > >>>>>> time-series >> > > >>>>>> > > data >> > > >>>>>> > > > > > with >> > > >>>>>> > > > > > > > row >> > > >>>>>> > > > > > > > > key >> > > >>>>>> > > > > > > > > built based on ts) >> > > >>>>>> > > > > > > > > * using random unique IDs for records >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > First approach makes possible to perform fast >> > range >> > > >>>>>> scans >> > > >>>>>> > with >> > > >>>>>> > > > help >> > > >>>>>> > > > > > of >> > > >>>>>> > > > > > > > > setting >> > > >>>>>> > > > > > > > > start/stop keys on Scanner, but creates single >> > > region >> > > >>>>>> server >> > > >>>>>> > > > > > > hot-spotting >> > > >>>>>> > > > > > > > > problem upon writing data (as row keys go in >> > > sequence >> > > >>>>>> all >> > > >>>>>> > > records >> > > >>>>>> > > > > end >> > > >>>>>> > > > > > > up >> > > >>>>>> > > > > > > > > written into a single region at a time). >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > Second approach aims for fastest writing >> > performance >> > > >>>>>> by >> > > >>>>>> > > > > distributing >> > > >>>>>> > > > > > > new >> > > >>>>>> > > > > > > > > records over random regions but makes not >> possible >> > > >>>>>> doing fast >> > > >>>>>> > > > range >> > > >>>>>> > > > > > > scans >> > > >>>>>> > > > > > > > > against written data. >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > The suggested approach stays in the middle of >> the >> > > two >> > > >>>>>> above >> > > >>>>>> > and >> > > >>>>>> > > > > > proved >> > > >>>>>> > > > > > > to >> > > >>>>>> > > > > > > > > perform well by distributing records over the >> > > cluster >> > > >>>>>> during >> > > >>>>>> > > > > writing >> > > >>>>>> > > > > > > data >> > > >>>>>> > > > > > > > > while allowing range scans over it. HBaseWD >> > provides >> > > >>>>>> very >> > > >>>>>> > > simple >> > > >>>>>> > > > > API >> > > >>>>>> > > > > > to >> > > >>>>>> > > > > > > > > work with which makes it perfect to use with >> > > existing >> > > >>>>>> code. >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > Please refer to unit-tests for lib usage info >> as >> > > they >> > > >>>>>> aimed >> > > >>>>>> > to >> > > >>>>>> > > > act >> > > >>>>>> > > > > as >> > > >>>>>> > > > > > > > > example. >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > Brief Usage Info (Examples): >> > > >>>>>> > > > > > > > > ---------------------------- >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > Distributing records with sequential keys which >> > are >> > > >>>>>> being >> > > >>>>>> > > written >> > > >>>>>> > > > > in >> > > >>>>>> > > > > > up >> > > >>>>>> > > > > > > > to >> > > >>>>>> > > > > > > > > Byte.MAX_VALUE buckets: >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > byte bucketsCount = (byte) 32; // >> distributing >> > > into >> > > >>>>>> 32 >> > > >>>>>> > > buckets >> > > >>>>>> > > > > > > > > RowKeyDistributor keyDistributor = >> > > >>>>>> > > > > > > > > new >> > > >>>>>> > > > > > > > > RowKeyDistributorByOneBytePrefix(bucketsCount); >> > > >>>>>> > > > > > > > > for (int i = 0; i < 100; i++) { >> > > >>>>>> > > > > > > > > Put put = new >> > > >>>>>> > > > > > Put(keyDistributor.getDistributedKey(originalKey)); >> > > >>>>>> > > > > > > > > ... // add values >> > > >>>>>> > > > > > > > > hTable.put(put); >> > > >>>>>> > > > > > > > > } >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > Performing a range scan over written data >> > > (internally >> > > >>>>>> > > > > <bucketsCount> >> > > >>>>>> > > > > > > > > scanners >> > > >>>>>> > > > > > > > > executed): >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > Scan scan = new Scan(startKey, stopKey); >> > > >>>>>> > > > > > > > > ResultScanner rs = >> > > >>>>>> DistributedScanner.create(hTable, scan, >> > > >>>>>> > > > > > > > > keyDistributor); >> > > >>>>>> > > > > > > > > for (Result current : rs) { >> > > >>>>>> > > > > > > > > ... >> > > >>>>>> > > > > > > > > } >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > Performing mapreduce job over written data >> chunk >> > > >>>>>> specified by >> > > >>>>>> > > > Scan: >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > Configuration conf = >> > HBaseConfiguration.create(); >> > > >>>>>> > > > > > > > > Job job = new Job(conf, "testMapreduceJob"); >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > Scan scan = new Scan(startKey, stopKey); >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > >> TableMapReduceUtil.initTableMapperJob("table", >> > > >>>>>> scan, >> > > >>>>>> > > > > > > > > RowCounterMapper.class, >> > > >>>>>> ImmutableBytesWritable.class, >> > > >>>>>> > > > > > > Result.class, >> > > >>>>>> > > > > > > > > job); >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > // Substituting standard TableInputFormat >> which >> > > was >> > > >>>>>> set in >> > > >>>>>> > > > > > > > > // >> TableMapReduceUtil.initTableMapperJob(...) >> > > >>>>>> > > > > > > > > >> > > job.setInputFormatClass(WdTableInputFormat.class); >> > > >>>>>> > > > > > > > > >> keyDistributor.addInfo(job.getConfiguration()); >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > Extending Row Keys Distributing Patterns: >> > > >>>>>> > > > > > > > > ----------------------------------------- >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > HBaseWD is designed to be flexible and to >> support >> > > >>>>>> custom row >> > > >>>>>> > > key >> > > >>>>>> > > > > > > > > distribution >> > > >>>>>> > > > > > > > > approaches. To define custom row key >> distributing >> > > >>>>>> logic just >> > > >>>>>> > > > > > implement >> > > >>>>>> > > > > > > > > AbstractRowKeyDistributor abstract class which >> is >> > > >>>>>> really very >> > > >>>>>> > > > > simple: >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > > public abstract class >> AbstractRowKeyDistributor >> > > >>>>>> implements >> > > >>>>>> > > > > > > > > Parametrizable { >> > > >>>>>> > > > > > > > > public abstract byte[] >> > getDistributedKey(byte[] >> > > >>>>>> > > > originalKey); >> > > >>>>>> > > > > > > > > public abstract byte[] >> getOriginalKey(byte[] >> > > >>>>>> > adjustedKey); >> > > >>>>>> > > > > > > > > public abstract byte[][] >> > > >>>>>> getAllDistributedKeys(byte[] >> > > >>>>>> > > > > > > originalKey); >> > > >>>>>> > > > > > > > > ... // some utility methods >> > > >>>>>> > > > > > > > > } >> > > >>>>>> > > > > > > > > >> > > >>>>>> > > > > > > > >> > > >>>>>> > > > > > > >> > > >>>>>> > > > > > >> > > >>>>>> > > > > >> > > >>>>>> > > > >> > > >>>>>> > > >> > > >>>>>> > >> > > >>>>>> >> > > >>>>> >> > > >>>>> >> > > >>>> >> > > >>> >> > > >> >> > > > >> > > >> > >> > >
