Hi Mike, I agree with you - the way you've outlined is exactly the way Phoenix has implemented it. It's a bit of a problem with terminology, though. We call it salting: http://phoenix.incubator.apache.org/salted.html. We hash the key, mod the hash with the SALT_BUCKET value you provide, and prepend the row key with this single byte value. Maybe you can coin a good term for this technique?
FWIW, you don't lose the ability to do a range scan when you salt (or hash-the-key and mod by the number of "buckets"), but you do need to run a scan for each possible value of your salt byte (0 - SALT_BUCKET-1). Then the client does a merge sort among these scans. It performs well. Thanks, James On Fri, May 9, 2014 at 11:57 PM, Michael Segel <michael_se...@hotmail.com>wrote: > 3+ Years on and a bad idea is being propagated again. > > Now repeat after me… DO NO USE A SALT. > > Having a low sodium diet, especially for HBase is really good for your > health and sanity. > > The salt is going to be orthogonal to the row key (Key). > There is no relationship to the specific Key. > > Using a salt means you now use the ability to randomly spread the > distribution of data to avoid HOT SPOTTING. > However you lose the ability to seek for a specific row. > > YOU HASH THE KEY. > > The hash whether you use SHA-1 or MD-5 is going to yield the same result > each and every time you provide the key. > > But wait, the generated hash is 160 bits long. We don’t need that! > Absolutely true if you just want to randomize the key to avoid hot > spotting. There’s this concept called truncating the hash to the desired > length. > So to Adrien’s point, you can truncate it to a single byte which would be > sufficient…. > Now when you want to seek for a specific row, you can find it. > > The downside to any solution is that you lose the ability to do a range > scan. > BUT BY USING A HASH AND NOT A SALT, YOU DONT LOSE THE ABILITY TO FETCH A > SINGLE ROW VIA A get() CALL. > > <rant> > This simple fact has been pointed out several years ago, yet for some > reason, the use of a salt persists. > I’ve actually made that part of the HBase course I wrote and use it in my > presentation(s) on HBase. > > It amazes me that the committers and regulars who post here still don’t > grok the fact that if you’re going to ‘SALT’ a row, you might as well not > use HBase and stick with Hive. > I remember Ed C’s rant about how preferential treatment on Hive patches > was given to vendors’ committers… that preferential treatment seems to also > be extended speakers at conferences. It wouldn’t be a problem if those said > speakers actually knew the topic… ;-) > > Propagation of bad ideas means that you’re leaving a lot of performance on > the table and it can kill or cripple projects. > > </rant> > > Sorry for the rant… > > -Mike > > > > > On May 3, 2014, at 4:39 PM, Software Dev <static.void....@gmail.com> > wrote: > > > Ok so there is no way around the FuzzyRowFilter checking every single > > row in the table correct? If so, what is a valid use case for that > > filter? > > > > Ok so salt to a low enough prefix that makes scanning reasonable. Our > > client for accessing these tables is a Rails (not JRuby) application > > so we are stuck with either the Thrift or Rails client. Can either of > > these perform multiple gets/scans? > > > > > > > > On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet <adrien.moge...@gmail.com> > wrote: > >> Using 4 random bytes you'll get 2^32 possibilities; thus your data can > be > >> split enough among all the possible regions, but you won't be able to > >> easily benefit from distributed scans to gather what you want. > >> > >> Let say you want to split (time+login) with a salted key and you expect > to > >> be able to retrieve events from 20140429 pretty fast. Then I would split > >> input data among 10 "spans", spread over 10 regions and 10 RS (ie: > `$random > >> % 10'). To retrieve ordered data, I would parallelize Scans over the 10 > >> span groups (<00>-20140429, <01>-20140429...) and merge-sort everything > >> until I've got all the expected results. > >> > >> So in term of performances this looks "a little bit" faster than your > 2^32 > >> randomization. > >> > >> > >> On Fri, May 2, 2014 at 10:09 PM, Software Dev < > static.void....@gmail.com>wrote: > >> > >>> I'm planning to work with FuzzyRowFilter to avoid hot spotting of our > >>> time series data (20140501, 20140502...). We can prefix all of the > >>> keys with 4 random bytes and then just skip these during scanning. Is > >>> that correct? These *seems* like it will work but Im questioning the > >>> performance of this even if it does work. > >>> > >>> Also, is this available via the rest client, shell and/or thrift > client? > >>> > >>> Also, is there a FuzzyColumn equivalent of this feature? > >>> > >> > >> > >> > >> -- > >> Adrien Mogenet > >> http://www.borntosegfault.com > > > >