Not to take the thread off topic, but do you have any links to information about importing directly into hfiles?
Thanks, Mike On Mon, Jul 27, 2009 at 3:08 PM, Ryan Rawson <[email protected]> wrote: > Hi, > > You have to consider the difference between a bulk one time import and > a continuous row insertion process. Often the former needs to achieve > extremely high insert rates (150kops/sec + ) to import a large > multi-100million data set in any reasonable time frame. But the > latter tends to be fairly slow, unless you are planning on adding > users faster than 20,000 a second, you probably don't need to hash > userids. > > It should be possible to randomly insert data from a pre-existing data > set. There is some work to directly import straight into hfiles and > skipping the regionserver, but that would only really work on 1 time > imports to new tables. > > > On Mon, Jul 27, 2009 at 12:01 PM, Fernando Padilla<[email protected]> > wrote: > > So I will be generating lots of rows into the db keyed by userId, in > userId > > order. > > > > I have already learned through this mailing list that this use-case is > not > > ideal, since it would mean most row-inserts will be on one region server. > I > > know that some people suggest to add some randomization to the keys so > that > > it will be spread around, but I can't do that, since I'll need to be able > to > > do random access lookup on the rows via userId. > > > > > > But I'm wondering if I could map/hash the real userId, into another > number > > that will spread around the inserts. And I can still do random access > > lookups given a real userId, by calculating the hash.. > > > > > > > > 1) i think i like this idea, does anyone have any experience with this? > > > > 2) assume userId is a 8byte long, what would be some good hashing > functions? > > I would be lazy and use little-endian, but I bet one of you could come > up > > with something better. :) > > > > >
