Re: hbase bulk writes

Stack Tue, 01 Dec 2009 20:49:00 -0800

I do not know of schemas where there are more than 5 or 6 families. Myguess is that there will be issues. One issue for sure is that we donot parallelize queries across families yet. The queries run inseries so will be slow when lots of families involved. It shouldn'tbe hard to address. It just has not been a priority.


You might be ok if only 5 families queried at a time as part of a mr job

Going by the content of your first mail my guess is that you are clearon the difference between a column family and a column. If you arenot clear I would suggest you lok into it. Maybe you do not need thatmany families


On Dec 1, 2009, at 4:03 PM, Calvin <[email protected]> wrote:

Thanks again for the responses.
Stack: What's the issue with 25 families? I will mostly beaccessing HBaseas a map-reduce source and will be looking at ~5 column families ata time.
Is there any documentation on column family limits in practice?

-Calvin

On Mon, Nov 30, 2009 at 7:38 PM, stack <[email protected]> wrote:
There is little art to the HFileOutputFormat. You could play withit tomake it support multiple column families (As soon as any familyhits theregion size boundary, start up a new set of files). If keys arealreadysorted, could hook up HFileOutputFormat to the map (I think) andavoid the
mapreduce framework sort.
25 families is probably too many for current hbase -- depends onhow you'll
be accessing them.

Can you put your input files under an http server and then write a
mapreduce
that pulls via HTTP?

St.Ack
On Mon, Nov 30, 2009 at 3:33 PM, Calvin <[email protected]>wrote:
Thanks for the responses. If I can avoid writing a map-reduce jobthat
would be preferable (getting map-reduce to work with / depend on my
existing
infrastructure is turning out to be annoying).

I have no good way of randomizing my dataset since it's a very large
stream
of sequential data (ordered by some key). I have a fair number ofcolumn
families (~25) and every column is a long or a double.  Having a
standalone
program that writes rows using the HTable / Put API seems to run at
~2-5000
rows/sec, which seems ridiculously slow.  Is it possible I am doing
something terribly wrong?

-Calvin
On Mon, Nov 30, 2009 at 5:47 PM, Ryan Rawson <[email protected]>wrote:
Sequentially ordered rows is the worst insert case in HBase - youend
up writing all to 1 server even if you have 500.  If you could
randomize your input, and I have pasted a Randomize.java map reduce
that will randomize lines of a file, then your performance will
improve.

I have seen sustained inserts of 100-300k rows/sec on small rows
before. Obviously large blob rows will be slower, since thelimitingfactor is how fast we can write data to HDFS, thus it isnt theactual
row count, but the amount of data involved.
Try the randomize.java, see where that gets you. I think it's onthe
list archives.

-ryan


On Mon, Nov 30, 2009 at 2:41 PM, Jean-Daniel Cryans <
[email protected]
wrote:
Could you put your data in HDFS and load it from there with a
MapReduce
job?
J-D

On Mon, Nov 30, 2009 at 2:33 PM, Calvin <[email protected]>
wrote:
I have a large amount of sequential ordered rows I would like to
write
to an
HBase table.  What is the preferred way to do bulk writes of
multi-column
tables in HBase? Using the get/put interface seems fairly sloweven
if
I
bulk writes with table.put(List<Put>).

I have followed the directions on:
 * http://wiki.apache.org/hadoop/PerformanceTuning
 *
http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html
Are there any other resources for improving the throughput of my
bulk
writes?  On
http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/mapreduce/package-summary.htmlI
see there's a way to write HFiles directly, but HFileOutputFormat
can
only
write a single column famly at a time (
https://issues.apache.org/jira/browse/HBASE-1861).

Thanks!

-Calvin

Re: hbase bulk writes

Reply via email to