Re: hbase bulk writes

stack Mon, 30 Nov 2009 16:39:17 -0800

There is little art to the HFileOutputFormat.  You could play with it to
make it support multiple column families (As soon as any family hits the
region size boundary, start up a new set of files).  If keys are already
sorted, could hook up HFileOutputFormat to the map (I think) and avoid the
mapreduce framework sort.


25 families is probably too many for current hbase -- depends on how you'll
be accessing them.

Can you put your input files under an http server and then write a mapreduce
that pulls via HTTP?

St.Ack

On Mon, Nov 30, 2009 at 3:33 PM, Calvin <[email protected]> wrote:

> Thanks for the responses.  If I can avoid writing a map-reduce job that
> would be preferable (getting map-reduce to work with / depend on my
> existing
> infrastructure is turning out to be annoying).
>
> I have no good way of randomizing my dataset since it's a very large stream
> of sequential data (ordered by some key).  I have a fair number of column
> families (~25) and every column is a long or a double.  Having a standalone
> program that writes rows using the HTable / Put API seems to run at ~2-5000
> rows/sec, which seems ridiculously slow.  Is it possible I am doing
> something terribly wrong?
>
> -Calvin
>
> On Mon, Nov 30, 2009 at 5:47 PM, Ryan Rawson <[email protected]> wrote:
>
> > Sequentially ordered rows is the worst insert case in HBase - you end
> > up writing all to 1 server even if you have 500.  If you could
> > randomize your input, and I have pasted a Randomize.java map reduce
> > that will randomize lines of a file, then your performance will
> > improve.
> >
> > I have seen sustained inserts of 100-300k rows/sec on small rows
> > before.  Obviously large blob rows will be slower, since the limiting
> > factor is how fast we can write data to HDFS, thus it isnt the actual
> > row count, but the amount of data involved.
> >
> > Try the randomize.java, see where that gets you. I think it's on the
> > list archives.
> >
> > -ryan
> >
> >
> > On Mon, Nov 30, 2009 at 2:41 PM, Jean-Daniel Cryans <[email protected]
> >
> > wrote:
> > > Could you put your data in HDFS and load it from there with a MapReduce
> > job?
> > >
> > > J-D
> > >
> > > On Mon, Nov 30, 2009 at 2:33 PM, Calvin <[email protected]>
> wrote:
> > >> I have a large amount of sequential ordered rows I would like to write
> > to an
> > >> HBase table.  What is the preferred way to do bulk writes of
> > multi-column
> > >> tables in HBase?  Using the get/put interface seems fairly slow even
> if
> > I
> > >> bulk writes with table.put(List<Put>).
> > >>
> > >> I have followed the directions on:
> > >>   * http://wiki.apache.org/hadoop/PerformanceTuning
> > >>   *
> > >>
> >
> http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html
> > >>
> > >> Are there any other resources for improving the throughput of my bulk
> > >> writes?  On
> > >>
> >
> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/mapreduce/package-summary.htmlI
> > >> see there's a way to write HFiles directly, but HFileOutputFormat can
> > >> only
> > >> write a single column famly at a time (
> > >> https://issues.apache.org/jira/browse/HBASE-1861).
> > >>
> > >> Thanks!
> > >>
> > >> -Calvin
> > >>
> > >
> >
>

Re: hbase bulk writes

Reply via email to