Re: HBase / Pig integration

Dmitriy Ryaboy Tue, 04 May 2010 09:31:35 -0700

Stack,

Having an estimated row count per region would be great. Should I open a
ticket for that?
I'll check out the HBase counters.


Now that you mention it, we do set LZO compression for our tables, I just
forgot it's a manual thing :).

-Dmitriy

On Tue, May 4, 2010 at 7:34 AM, Stack <st...@duboce.net> wrote:

> On Tue, May 4, 2010 at 12:02 AM, Dmitriy Ryaboy <dmit...@twitter.com>
> wrote:
> >
> > We have an apache license file in the root of the project; I am not sure
> if
> > we need to put it in every file. Will check with the lawyers.
> >
> Generally you put notice at head of each src file.  I was particularly
> referring to this that we have in all hbase files:
>
> " * Copyright 2010 The Apache Software Foundation"
>
> We got this practise from hadoop but looking there, they no longer
> seem to do it (I need to talk to lawyers too -- smile).
>
> > Regarding the first and last slice, the problem is that I have no way of
> > knowing what the first and last, respectively, key values are. With the
> > first slice I can maybe cache the first key I see, and use that in
> > conjunction with the end of the region to calculate the size of the
> > keyspace; but with that last region, the max is infinity, so I can't
> really
> > estimate how much more I have left until I have none.. do regions store
> any
> > metadata that countains a rough count of the number of records they hold?
>
> Regions no.  StoreFiles yes.  They have number of entries but this is
> not really available via API.  We should expose it or something like
> it.  It could only be an estimate since delete and put both records
> and a delete can remove cell, column or family.
>
>
> I
> > guess they only keep track of the byte size of the data, not the number
> of
> > records per se.   Maybe I can get the total byte size of the region, and
> > calculate offsets based on the size of the returned data? This would be
> > likely wrong due to pushed down projections and filters, of course. Any
> > other ideas? How do people normally handle this when writing regular MR
> jobs
> > that scan HBase tables?
> >
>
> I think most tasks that go against hbase should 0% progress and then
> 100% when done.
>
> We could expose a getLastRowInRegion or what if we added an estimated
> row count to the Split (Maybe thats not the right place to expose this
> info?  What is the canonical way?).
>
> > I suspect this is actually a bit of a problem, btw -- since I don't
> report
> > the amount of remaining work for these slices accurately, and I
> (hopefully)
> > do a reasonable job for the ones where I can calculate the size of the
> > keyspace, speculative execution may get overeager with these two slices.
> >
> Good point.
>
> We should fix this.  Keeping a counter of how many rows in a region
> wouldn't be hard.  It could be updated on compaction, etc.  A row
> count would be good enough.
>
>
> > PigCounterHelper just deals with some oddities of Hadoop counters (they
> may
> > not be available when you first try to increment a counter -- the helper
> > buffers increment requests until the reporter becomes available). Are
> HBase
> > counters special things or also just Hadoop counters under the covers?
> >
> Check them out.  They are not  hadoop counters.  Keep up a count on
> anything.  Might be of use given what you are doing.  Update it
> thousands of times a second, etc.
>
>
> > The lzo files are probably unrelated.. there shouldn't be anything
> > LZO-specific in the HBase code. We are, in fact, lzo'ing hbase content in
> > the sense that that's the compression we have for HDFS, and I think HBase
> is
> > supposed to inherit that.
>
> No.  You need to enable it on the column family.  See how COMPRESSION
> can be NONE, GZ, or LZO.  LZO needs to be installed as per hadoop.
> Search the hbase wiki home page for lzo.  The hbase+lzo page has had
> some experience baked in to it so may be of some use to you.
>
> St.Ack
>

Re: HBase / Pig integration

Reply via email to