Not to take the thread off topic, but do you have any links to information
about importing directly into hfiles?

Thanks,
Mike



On Mon, Jul 27, 2009 at 3:08 PM, Ryan Rawson <[email protected]> wrote:

> Hi,
>
> You have to consider the difference between a bulk one time import and
> a continuous row insertion process.  Often the former needs to achieve
> extremely high insert rates (150kops/sec + ) to import a large
> multi-100million data set in any reasonable time frame.  But the
> latter tends to be fairly slow, unless you are planning on adding
> users faster than 20,000 a second, you probably don't need to hash
> userids.
>
> It should be possible to randomly insert data from a pre-existing data
> set.  There is some work to directly import straight into hfiles and
> skipping the regionserver, but that would only really work on 1 time
> imports to new tables.
>
>
> On Mon, Jul 27, 2009 at 12:01 PM, Fernando Padilla<[email protected]>
> wrote:
> > So I will be generating lots of rows into the db keyed by userId, in
> userId
> > order.
> >
> > I have already learned through this mailing list that this use-case is
> not
> > ideal, since it would mean most row-inserts will be on one region server.
>  I
> > know that some people suggest to add some randomization to the keys so
> that
> > it will be spread around, but I can't do that, since I'll need to be able
> to
> > do random access lookup on the rows via userId.
> >
> >
> > But I'm wondering if I could map/hash the real userId, into another
> number
> > that will spread around the inserts.  And I can still do random access
> > lookups given a real userId, by calculating the hash..
> >
> >
> >
> > 1) i think i like this idea, does anyone have any experience with this?
> >
> > 2) assume userId is a 8byte long, what would be some good hashing
> functions?
> >  I would be lazy and use little-endian, but I bet one of you could come
> up
> > with something better. :)
> >
> >
>

Reply via email to