Hi,

You have to consider the difference between a bulk one time import and
a continuous row insertion process.  Often the former needs to achieve
extremely high insert rates (150kops/sec + ) to import a large
multi-100million data set in any reasonable time frame.  But the
latter tends to be fairly slow, unless you are planning on adding
users faster than 20,000 a second, you probably don't need to hash
userids.

It should be possible to randomly insert data from a pre-existing data
set.  There is some work to directly import straight into hfiles and
skipping the regionserver, but that would only really work on 1 time
imports to new tables.


On Mon, Jul 27, 2009 at 12:01 PM, Fernando Padilla<[email protected]> wrote:
> So I will be generating lots of rows into the db keyed by userId, in userId
> order.
>
> I have already learned through this mailing list that this use-case is not
> ideal, since it would mean most row-inserts will be on one region server.  I
> know that some people suggest to add some randomization to the keys so that
> it will be spread around, but I can't do that, since I'll need to be able to
> do random access lookup on the rows via userId.
>
>
> But I'm wondering if I could map/hash the real userId, into another number
> that will spread around the inserts.  And I can still do random access
> lookups given a real userId, by calculating the hash..
>
>
>
> 1) i think i like this idea, does anyone have any experience with this?
>
> 2) assume userId is a 8byte long, what would be some good hashing functions?
>  I would be lazy and use little-endian, but I bet one of you could come up
> with something better. :)
>
>

Reply via email to