Thanks again for the responses.

Stack: What's the issue with 25 families?  I will mostly be accessing HBase
as a map-reduce source and will be looking at ~5 column families at a time.
 Is there any documentation on column family limits in practice?

-Calvin

On Mon, Nov 30, 2009 at 7:38 PM, stack <[email protected]> wrote:

> There is little art to the HFileOutputFormat.  You could play with it to
> make it support multiple column families (As soon as any family hits the
> region size boundary, start up a new set of files).  If keys are already
> sorted, could hook up HFileOutputFormat to the map (I think) and avoid the
> mapreduce framework sort.
>
> 25 families is probably too many for current hbase -- depends on how you'll
> be accessing them.
>
> Can you put your input files under an http server and then write a
> mapreduce
> that pulls via HTTP?
>
> St.Ack
>
> On Mon, Nov 30, 2009 at 3:33 PM, Calvin <[email protected]> wrote:
>
> > Thanks for the responses.  If I can avoid writing a map-reduce job that
> > would be preferable (getting map-reduce to work with / depend on my
> > existing
> > infrastructure is turning out to be annoying).
> >
> > I have no good way of randomizing my dataset since it's a very large
> stream
> > of sequential data (ordered by some key).  I have a fair number of column
> > families (~25) and every column is a long or a double.  Having a
> standalone
> > program that writes rows using the HTable / Put API seems to run at
> ~2-5000
> > rows/sec, which seems ridiculously slow.  Is it possible I am doing
> > something terribly wrong?
> >
> > -Calvin
> >
> > On Mon, Nov 30, 2009 at 5:47 PM, Ryan Rawson <[email protected]> wrote:
> >
> > > Sequentially ordered rows is the worst insert case in HBase - you end
> > > up writing all to 1 server even if you have 500.  If you could
> > > randomize your input, and I have pasted a Randomize.java map reduce
> > > that will randomize lines of a file, then your performance will
> > > improve.
> > >
> > > I have seen sustained inserts of 100-300k rows/sec on small rows
> > > before.  Obviously large blob rows will be slower, since the limiting
> > > factor is how fast we can write data to HDFS, thus it isnt the actual
> > > row count, but the amount of data involved.
> > >
> > > Try the randomize.java, see where that gets you. I think it's on the
> > > list archives.
> > >
> > > -ryan
> > >
> > >
> > > On Mon, Nov 30, 2009 at 2:41 PM, Jean-Daniel Cryans <
> [email protected]
> > >
> > > wrote:
> > > > Could you put your data in HDFS and load it from there with a
> MapReduce
> > > job?
> > > >
> > > > J-D
> > > >
> > > > On Mon, Nov 30, 2009 at 2:33 PM, Calvin <[email protected]>
> > wrote:
> > > >> I have a large amount of sequential ordered rows I would like to
> write
> > > to an
> > > >> HBase table.  What is the preferred way to do bulk writes of
> > > multi-column
> > > >> tables in HBase?  Using the get/put interface seems fairly slow even
> > if
> > > I
> > > >> bulk writes with table.put(List<Put>).
> > > >>
> > > >> I have followed the directions on:
> > > >>   * http://wiki.apache.org/hadoop/PerformanceTuning
> > > >>   *
> > > >>
> > >
> >
> http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html
> > > >>
> > > >> Are there any other resources for improving the throughput of my
> bulk
> > > >> writes?  On
> > > >>
> > >
> >
> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/mapreduce/package-summary.htmlI
> > > >> see there's a way to write HFiles directly, but HFileOutputFormat
> can
> > > >> only
> > > >> write a single column famly at a time (
> > > >> https://issues.apache.org/jira/browse/HBASE-1861).
> > > >>
> > > >> Thanks!
> > > >>
> > > >> -Calvin
> > > >>
> > > >
> > >
> >
>

Reply via email to