Hi, You have to consider the difference between a bulk one time import and a continuous row insertion process. Often the former needs to achieve extremely high insert rates (150kops/sec + ) to import a large multi-100million data set in any reasonable time frame. But the latter tends to be fairly slow, unless you are planning on adding users faster than 20,000 a second, you probably don't need to hash userids.
It should be possible to randomly insert data from a pre-existing data set. There is some work to directly import straight into hfiles and skipping the regionserver, but that would only really work on 1 time imports to new tables. On Mon, Jul 27, 2009 at 12:01 PM, Fernando Padilla<[email protected]> wrote: > So I will be generating lots of rows into the db keyed by userId, in userId > order. > > I have already learned through this mailing list that this use-case is not > ideal, since it would mean most row-inserts will be on one region server. I > know that some people suggest to add some randomization to the keys so that > it will be spread around, but I can't do that, since I'll need to be able to > do random access lookup on the rows via userId. > > > But I'm wondering if I could map/hash the real userId, into another number > that will spread around the inserts. And I can still do random access > lookups given a real userId, by calculating the hash.. > > > > 1) i think i like this idea, does anyone have any experience with this? > > 2) assume userId is a 8byte long, what would be some good hashing functions? > I would be lazy and use little-endian, but I bet one of you could come up > with something better. :) > >
