I think we should revive secondary indexes discussion (actually it has been revived)
Since Ramkrishna has design in mind, he would be the best person to log a new JIRA. Cheers On Tue, Aug 28, 2012 at 10:03 AM, Jesse Yates <jesse.k.ya...@gmail.com>wrote: > @Ted: Are you proposing re-opening the should we have secondary indexes in > HBase discussion? If so, I'm +1 on adding them. Wanna file a jira? > > @Wei Tan: Yeah, I generally agree. However, I think you can get away with > ignoring MVCC and just keep an index on the latest key (where key > _includes_ the timestamp) and then do lazy cleanup. > > @Ram: if you move the TS into the CQ you can remove the actual TS (so it > costs you some minor computational overhead to pull it out), still giving > you the right answer without actually using HBase timestamps. > > I've proposed that you can just do an async cleanup of the index when you > find out its stale, with minimal overhead to the clients. Otherwise, yes, > you would need a way to tie together the versions in the index and primary > tables, which you don't always want to keep exactly the same. > > Also, there is an issue when returning the version of the row based on the > indexed TS. Should you return the whole row? Should you return just the > parts of the row with timestamps the same age or older? For the latter, how > you do know which parts of the row to return when you have two versions of > the same column that was indexed (which other row elements should be > include based on TS)? I'd propose all questions that need to be answered if > we are going to do a general hbase index. > ------------------- > Jesse Yates > @jesse_yates > jyates.github.com > > > On Tue, Aug 28, 2012 at 9:03 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > > I think this discussion should be on HBASE JIRA. > > > > Another dimension to secondary indexing is the co-location (or pairing) > of > > data table region and index table region. Related regions from the two > > tables should be placed on the same region server. > > > > Cheers > > > > On Tue, Aug 28, 2012 at 8:52 AM, Wei Tan <w...@us.ibm.com> wrote: > > > > > Thanks for sharing a pointer to your implementation. > > > My two cents: > > > timestamp is a way to do MVCC and setting every KV with the same TS > will > > > get concurrency control very tricky and error prone, if not impossible > > > I think Ram is talking about the dead entry in the index table rather > > than > > > data table. Deleting old index entries upfront when there is a new put > > > might be a choice. > > > > > > > > > Best Regards, > > > Wei > > > > > > Wei Tan > > > Research Staff Member > > > IBM T. J. Watson Research Center > > > 19 Skyline Dr, Hawthorne, NY 10532 > > > w...@us.ibm.com; 914-784-6752 > > > > > > > > > > > > From: Jesse Yates <jesse.k.ya...@gmail.com> > > > To: dev@hbase.apache.org, > > > Date: 08/28/2012 04:00 AM > > > Subject: Re: A general question on maxVersion handling when we > > have > > > Secondary index tables > > > > > > > > > > > > Ram, > > > > > > If I understand correctly, I think you can design your index such that > > you > > > don't actually use the timestamp (e.g. everything gets put with a TS = > 10 > > > - > > > or some other non-special, relatively small number that's not 0 as I'd > > > worry about that in HBase ;) Then when you set maxVersions to 1, > > > everything > > > should be good. > > > > > > You get a couple of wasted bytes from the TS, but with the prefixTrie > > > stuff > > > that should be pretty minimal overhead. If you do need to keep track of > > > the > > > timestamp you should be able to munge that back up into the column > > > qualifier (and just know that that last 64 bits is the timestamp). > Again > > a > > > little more CPU cost, but its really not that big of an overhead. It > > seems > > > like you don't really care about the TS though, in which case this > should > > > be pretty simple. > > > > > > Out of curiosity, what are people using for their secondary indexing > > > solutions? I know there are a bunch out there, but don't know what > people > > > have adopted, what they like/dislike, design tradeoffs made and why. > > > > > > Disclaimer: I recently proposed a secondary indexing solution myself > > > (shameless self-plug: > > > > > > > > > http://jyates.github.com/2012/07/09/consistent-enough-secondary-indexes.html > > > ) > > > and its something I'm working on for Salesforce - open sourced at some > > > point, promise! > > > > > > -Jesse > > > ------------------- > > > Jesse Yates > > > @jesse_yates > > > jyates.github.com > > > > > > > > > On Tue, Aug 28, 2012 at 12:24 AM, Ramkrishna.S.Vasudevan < > > > ramkrishna.vasude...@huawei.com> wrote: > > > > > > > Hi All > > > > > > > > > > > > > > > > When we try to build any type of secondary indices for a given table > > how > > > > can > > > > one handle maxVersions in the secondary index tables. > > > > > > > > > > > > > > > > For eg, > > > > > > > > I have inserted > > > > > > > > Row1 - Val1 => t > > > > > > > > Row1 - Val2 => t+1 > > > > > > > > Row1 - Val3. => t+2 > > > > > > > > > > > > > > > > Ideally if my max versions is only one then Val3 should be my result > If > > > I > > > > query on main table for row1. > > > > > > > > > > > > > > > > Now in my index I will be having all the above 3 entries. Now how > can > > > we > > > > remove the older entries from the index table that does not fit into > > > > maxVersions. > > > > > > > > > > > > > > > > Currently while scanning and the code that avoids the max Versions > does > > > not > > > > give any hooks to know the entries skipped thro versions. > > > > > > > > So any suggestions on this, I am still seeing the code for any other > > > > options > > > > but suggestions welcome. > > > > > > > > > > > > > > > > Regards > > > > > > > > Ram > > > > > > > > > > > > > > > > >