I think both approaches should be provided to HBase users. These are new features that would both find proper usage scenarios.
Cheers On Jan 3, 2014, at 5:48 AM, ramkrishna vasudevan <ramkrishna.s.vasude...@gmail.com> wrote: > What is generally of interest? RLI or global level. I know it is based on > usecase but is there a common need? > > > On Fri, Jan 3, 2014 at 4:31 PM, Anoop John <anoop.hb...@gmail.com> wrote: > >> A proportional difference in time taken, wrt increase in # RSs (keeping >> No#rows matching values constant), would be what is of utmost interest. >> >> -Anoop- >> >> On Fri, Jan 3, 2014 at 3:49 PM, rajeshbabu chintaguntla < >> rajeshbabu.chintagun...@huawei.com> wrote: >> >>> >>> Here are some performance numbers with RLI. >>> >>> No Region servers : 4 >>> Data per region : 2 GB >>> >>> Regions/RS| Total regions| Blocksize(kb) |No#rows matching values| Time >>> taken(sec)| >>> 50 | 200| 64|199|102 >>> 50 | 200|8|199| 35 >>> 100|400 | 8| 350| 95 >>> 200| 800| 8| 353| 153 >>> >>> Without secondary index scan is taking in hours. >>> >>> >>> Thanks, >>> Rajeshbabu >>> ________________________________________ >>> From: Anoop John [anoop.hb...@gmail.com] >>> Sent: Friday, January 03, 2014 3:22 PM >>> To: user@hbase.apache.org >>> Subject: Re: secondary index feature >>> >>>> Is there any data on how RLI (or in particular Phoenix) query >> throughput >>> correlates with the number of region servers assuming homogeneously >>> distributed data? >>> >>> Phoenix is yet to add RLI. Now it is having global indexing only. Correct >>> James? >>> >>> RLI impl from Huawei (HIndex) is having some numbers wrt regions.. But I >>> doubt whether it is there large no# RSs. Do you have some data Rajesh >>> Babu? >>> >>> -Anoop- >>> >>> On Fri, Jan 3, 2014 at 3:11 PM, Henning Blohm <henning.bl...@zfabrik.de >>>> wrote: >>> >>>> Jesse, James, Lars, >>>> >>>> after looking around a bit and in particular looking into Phoenix >> (which >>> I >>>> find very interesting), assuming that you want a secondary indexing on >>>> HBASE without adding other infrastructure, there seems to be not a lot >> of >>>> choice really: Either go with a region-level (and co-processor based) >>>> indexing feature (Phoenix, Huawei, is IHBase dead?) or add an index >> table >>>> to store (index value, entity key) pairs. >>>> >>>> The main concern I have with region-level indexing (RLI) is that Gets >>>> potentially require to visit all regions. Compared to global index >> tables >>>> this seems to flatten the read-scalability curve of the cluster. In our >>>> case, we have a large data set (hence HBASE) that will be queried >> (mostly >>>> point-gets via an index) in some linear correlation with its size. >>>> >>>> Is there any data on how RLI (or in particular Phoenix) query >> throughput >>>> correlates with the number of region servers assuming homogeneously >>>> distributed data? >>>> >>>> Thanks, >>>> Henning >>>> >>>> >>>> >>>> >>>> On 24.12.2013 12:18, Henning Blohm wrote: >>>> >>>>> All that sounds very promising. I will give it a try and let you know >>>>> how things worked out. >>>>> >>>>> Thanks, >>>>> Henning >>>>> >>>>> On 12/23/2013 08:10 PM, Jesse Yates wrote: >>>>> >>>>>> The work that James is referencing grew out of the discussions Lars >>>>>> and I >>>>>> had (which lead to those blog posts). The solution we implement is >>>>>> designed >>>>>> to be generic, as James mentioned above, but was written with all the >>>>>> hooks >>>>>> necessary for Phoenix to do some really fast updates (or skipping >>> updates >>>>>> in the case where there is no change). >>>>>> >>>>>> You should be able to plug in your own simple index builder (there is >>>>>> an example >>>>>> in the phoenix codebase<https://github.com/forcedotcom/phoenix/tree/ >>>>>> master/src/main/java/com/salesforce/hbase/index/covered/example>) >>>>>> to basic solution which supports the same transactional guarantees as >>>>>> HBase >>>>>> (per row) + data guarantees across the index rows. There are more >>> details >>>>>> in the presentations James linked. >>>>>> >>>>>> I'd love you see if your implementation can fit into the framework we >>>>>> wrote >>>>>> - we would be happy to work to see if it needs some more hooks or >>>>>> modifications - I have a feeling this is pretty much what you guys >> will >>>>>> need >>>>>> >>>>>> -Jesse >>>>>> >>>>>> >>>>>> On Mon, Dec 23, 2013 at 10:01 AM, James Taylor< >> jtay...@salesforce.com> >>>>>> wrote: >>>>>> >>>>>> Henning, >>>>>>> Jesse Yates wrote the back-end of our global secondary indexing >> system >>>>>>> in >>>>>>> Phoenix. He designed it as a separate, pluggable module with no >>> Phoenix >>>>>>> dependencies. Here's an overview of the feature: >>>>>>> https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing. The >>>>>>> section that discusses the data guarantees and failure management >>> might >>>>>>> be >>>>>>> of interest to you: >>>>>>> >> https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing#data- >>>>>>> guarantees-and-failure-management >>>>>>> >>>>>>> This presentation also gives a good overview of the pluggability of >>> his >>>>>>> implementation: >>>>>>> >> http://files.meetup.com/1350427/PhoenixIndexing-SF-HUG_09-26-13.pptx >>>>>>> >>>>>>> Thanks, >>>>>>> James >>>>>>> >>>>>>> >>>>>>> On Mon, Dec 23, 2013 at 3:47 AM, Henning Blohm< >>> henning.bl...@zfabrik.de >>>>>>>> wrote: >>>>>>> >>>>>>> Lars, that is exactly why I am hesitant to use one the core level >>>>>>>> generic >>>>>>>> approaches (apart from having difficulties to identify the still >>> active >>>>>>>> projects): I have doubts I can sufficiently explain to myself when >>> and >>>>>>>> where they fail. >>>>>>>> >>>>>>>> With "toolbox approach" I meant to say that turning entity data >> into >>>>>>>> index data is not done generically but rather involving domain >>> specific >>>>>>>> application code that >>>>>>>> >>>>>>>> - indicates what makes an index key given an entity >>>>>>>> - indicates whether an index entry is still valid given an entity >>>>>>>> >>>>>>>> That code is also used during the index rebuild and trimming (an >> M/R >>>>>>>> Job) >>>>>>>> >>>>>>>> So validating whether an index entry is valid means to load the >>> entity >>>>>>>> pointed to and - before considering it a valid result - validating >>>>>>>> whether >>>>>>>> values of the entity still match with the index. >>>>>>>> >>>>>>>> The entity is written last, hence when the client dies halfway >>> through >>>>>>>> the update you may get stale index entries but nothing else should >>>>>>>> break. >>>>>>>> >>>>>>>> For scanning along the index, we are using a chunk iterator that >> is, >>> we >>>>>>>> read n index entries ahead and then do point look ups for the >>>>>>>> entities. How >>>>>>>> would you avoid point-gets when scanning via an index (as most >>> likely, >>>>>>>> entities are ordered independently from the index - hence the >> index)? >>>>>>>> >>>>>>>> Something really important to note is that there is no intention to >>>>>>>> build >>>>>>>> a completely generic solution, in particular not (this time - >> unlike >>>>>>>> the >>>>>>>> other post of mine you responded to) taking row versioning into >>>>>>>> account. >>>>>>>> Instead, row time stamps are used to delete stale entries (old >>> entries >>>>>>>> after an index rebuild). >>>>>>>> >>>>>>>> Thanks a lot for your blog pointers. Haven't had time to study in >>> depth >>>>>>>> but at first glance there is lot of overlap of what you are >> proposing >>>>>>>> and >>>>>>>> what I ended up doing considering the first post. >>>>>>>> >>>>>>>> On the second post: Indeed I have not worried too much about >>>>>>>> transactional isolation of updates. If index update and entity >> update >>>>>>>> use >>>>>>>> the same HBase time stamp, the result should at least be >> consistent, >>>>>>>> right? >>>>>>>> >>>>>>>> Btw. in no way am I claiming originality of my thoughts - in >>>>>>>> particular I >>>>>>>> readhttp://jyates.github.io/2012/07/09/consistent-enough- >>>>>>>> >>>>>>>> secondary-indexes.html a while back. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Henning >>>>>>>> >>>>>>>> Ps.: I might write about this discussion later in my blog >>>>>>>> >>>>>>>> >>>>>>>> On 22.12.2013 23:37, lars hofhansl wrote: >>>>>>>> >>>>>>>> The devil is often in the details. On the surface it looks simple. >>>>>>>>> >>>>>>>>> How specifically are the stale indexes ignored? Are the guaranteed >>> to >>>>>>>>> be >>>>>>>>> no races? >>>>>>>>> Is deletion handled correctly?Does it work with multiple versions? >>>>>>>>> What happens when the client dies 1/2 way through an update? >>>>>>>>> It's easy to do eventually consistent indexes. Truly consistent >>>>>>>>> indexes >>>>>>>>> without transactions are tricky. >>>>>>>>> >>>>>>>>> >>>>>>>>> Also, scanning an index and then doing point-gets against a main >>> table >>>>>>>>> is slow (unless the index is very selective. The Phoenix team >>>>>>>>> measured that >>>>>>>>> there is only an advantage if the index filters out 98-99% of the >>>>>>>>> data). >>>>>>>>> So then one would revert to covered indexes and suddenly is not so >>>>>>>>> easy >>>>>>>>> to detect stale index entries. >>>>>>>>> >>>>>>>>> I blogged about these issues here: >>>>>>>>> http://hadoop-hbase.blogspot.com/2012/10/musings-on- >>>>>>>>> secondary-indexes.html >>>>>>>>> http://hadoop-hbase.blogspot.com/2012/10/secondary-indexes- >>>>>>>>> part-ii.html >>>>>>>>> >>>>>>>>> Phoenix has a (pretty involved) solution now that works around the >>>>>>>>> fact >>>>>>>>> that HBase has no transactions. >>>>>>>>> >>>>>>>>> >>>>>>>>> -- Lars >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ________________________________ >>>>>>>>> From: Henning Blohm<henning.bl...@zfabrik.de> >>>>>>>>> To: user<user@hbase.apache.org> >>>>>>>>> Sent: Sunday, December 22, 2013 2:11 AM >>>>>>>>> Subject: secondary index feature >>>>>>>>> >>>>>>>>> Lately we have added a secondary index feature to a persistence >> tier >>>>>>>>> over HBASE. Essentially we implemented what is described as >>>>>>>>> "Dual-Write >>>>>>>>> Secondary Index" inhttp://hbase.apache.org/ >>>>>>>>> book/secondary.indexes.html. >>>>>>>>> >>>>>>>>> I.e. while updating an entity, actually before writing the actual >>>>>>>>> update, indexes are updated. Lookup via the index ignores stale >>>>>>>>> entries. >>>>>>>>> A recurring rebuild and clean out of stale entries takes care the >>>>>>>>> indexes are trimmed and accurate. >>>>>>>>> >>>>>>>>> None of this was terribly complex to implement. In fact, it seemed >>>>>>>>> like >>>>>>>>> something you could do generically, maybe not on the HBASE level >>>>>>>>> itself, >>>>>>>>> but as a toolbox / utility style library. >>>>>>>>> >>>>>>>>> Is anybody on the list aware of anything useful already existing >> in >>>>>>>>> that >>>>>>>>> space? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Henning Blohm >>>>>>>>> >>>>>>>>> *ZFabrik Software KG* >>>>>>>>> >>>>>>>>> T: +49 6227 3984255< >>> >> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# >>>> >>>>>>>>> F: +49 6227 3984254< >>> >> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# >>>> >>>>>>>>> M: +49 1781891820< >>> >> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# >>>> >>>>>>>>> >>>>>>>>> Lammstrasse 2 69190 Walldorf >>>>>>>>> >>>>>>>>> henning.bl...@zfabrik.de <mailto:henning.bl...@zfabrik.de> >>>>>>>>> Linkedin<http://www.linkedin.com/pub/henning-blohm/0/7b5/628> >>>>>>>>> ZFabrik<http://www.zfabrik.de> >>>>>>>>> Blog<http://www.z2-environment.net/blog> >>>>>>>>> Z2-Environment<http://www.z2-environment.eu> >>>>>>>>> Z2 Wiki<http://redmine.z2-environment.net> >>>>>>>>> >>>>>>>>> -- >>>>>>>> Henning Blohm >>>>>>>> >>>>>>>> *ZFabrik Software KG* >>>>>>>> >>>>>>>> T: +49 6227 3984255< >>> >> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# >>>> >>>>>>>> F: +49 6227 3984254< >>> >> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# >>>> >>>>>>>> M: +49 1781891820< >>> >> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# >>>> >>>>>>>> >>>>>>>> Lammstrasse 2 69190 Walldorf >>>>>>>> >>>>>>>> henning.bl...@zfabrik.de <mailto:henning.bl...@zfabrik.de> >>>>>>>> Linkedin<http://www.linkedin.com/pub/henning-blohm/0/7b5/628> >>>>>>>> ZFabrik<http://www.zfabrik.de> >>>>>>>> Blog<http://www.z2-environment.net/blog> >>>>>>>> Z2-Environment<http://www.z2-environment.eu> >>>>>>>> Z2 Wiki<http://redmine.z2-environment.net> >>>>>>>> >>>>>>>> >>>>>>>> >>>>> >>>>> -- >>>>> Henning Blohm >>>>> >>>>> *ZFabrik Software KG* >>>>> >>>>> T: +49 6227 3984255< >>> >> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# >>>> >>>>> F: +49 6227 3984254< >>> >> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# >>>> >>>>> M: +49 1781891820< >>> >> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# >>>> >>>>> >>>>> Lammstrasse 2 69190 Walldorf >>>>> >>>>> henning.bl...@zfabrik.de <mailto:henning.bl...@zfabrik.de> >>>>> Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628> >>>>> ZFabrik <http://www.zfabrik.de> >>>>> Blog <http://www.z2-environment.net/blog> >>>>> Z2-Environment <http://www.z2-environment.eu> >>>>> Z2 Wiki <http://redmine.z2-environment.net> >>>>> >>>>> >>>> >>>> -- >>>> Henning Blohm >>>> >>>> *ZFabrik Software KG* >>>> >>>> T: +49 6227 3984255< >>> >> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# >>>> >>>> F: +49 6227 3984254< >>> >> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# >>>> >>>> M: +49 1781891820< >>> >> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# >>>> >>>> >>>> Lammstrasse 2 69190 Walldorf >>>> >>>> henning.bl...@zfabrik.de <mailto:henning.bl...@zfabrik.de> >>>> Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628> >>>> ZFabrik <http://www.zfabrik.de> >>>> Blog <http://www.z2-environment.net/blog> >>>> Z2-Environment <http://www.z2-environment.eu> >>>> Z2 Wiki <http://redmine.z2-environment.net> >>>> >>>> >>> >>