Re: HBase / Pig integration

Stack Tue, 04 May 2010 07:35:06 -0700

On Tue, May 4, 2010 at 12:02 AM, Dmitriy Ryaboy <dmit...@twitter.com> wrote:
>
> We have an apache license file in the root of the project; I am not sure if
> we need to put it in every file. Will check with the lawyers.
>
Generally you put notice at head of each src file.  I was particularly
referring to this that we have in all hbase files:


" * Copyright 2010 The Apache Software Foundation"

We got this practise from hadoop but looking there, they no longer
seem to do it (I need to talk to lawyers too -- smile).

> Regarding the first and last slice, the problem is that I have no way of
> knowing what the first and last, respectively, key values are. With the
> first slice I can maybe cache the first key I see, and use that in
> conjunction with the end of the region to calculate the size of the
> keyspace; but with that last region, the max is infinity, so I can't really
> estimate how much more I have left until I have none.. do regions store any
> metadata that countains a rough count of the number of records they hold?

Regions no.  StoreFiles yes.  They have number of entries but this is
not really available via API.  We should expose it or something like
it.  It could only be an estimate since delete and put both records
and a delete can remove cell, column or family.


I
> guess they only keep track of the byte size of the data, not the number of
> records per se.   Maybe I can get the total byte size of the region, and
> calculate offsets based on the size of the returned data? This would be
> likely wrong due to pushed down projections and filters, of course. Any
> other ideas? How do people normally handle this when writing regular MR jobs
> that scan HBase tables?
>

I think most tasks that go against hbase should 0% progress and then
100% when done.

We could expose a getLastRowInRegion or what if we added an estimated
row count to the Split (Maybe thats not the right place to expose this
info?  What is the canonical way?).

> I suspect this is actually a bit of a problem, btw -- since I don't report
> the amount of remaining work for these slices accurately, and I (hopefully)
> do a reasonable job for the ones where I can calculate the size of the
> keyspace, speculative execution may get overeager with these two slices.
>
Good point.

We should fix this.  Keeping a counter of how many rows in a region
wouldn't be hard.  It could be updated on compaction, etc.  A row
count would be good enough.


> PigCounterHelper just deals with some oddities of Hadoop counters (they may
> not be available when you first try to increment a counter -- the helper
> buffers increment requests until the reporter becomes available). Are HBase
> counters special things or also just Hadoop counters under the covers?
>
Check them out.  They are not  hadoop counters.  Keep up a count on
anything.  Might be of use given what you are doing.  Update it
thousands of times a second, etc.


> The lzo files are probably unrelated.. there shouldn't be anything
> LZO-specific in the HBase code. We are, in fact, lzo'ing hbase content in
> the sense that that's the compression we have for HDFS, and I think HBase is
> supposed to inherit that.

No.  You need to enable it on the column family.  See how COMPRESSION
can be NONE, GZ, or LZO.  LZO needs to be installed as per hadoop.
Search the hbase wiki home page for lzo.  The hbase+lzo page has had
some experience baked in to it so may be of some use to you.

St.Ack

Re: HBase / Pig integration

Reply via email to