Stack, Having an estimated row count per region would be great. Should I open a ticket for that? I'll check out the HBase counters.
Now that you mention it, we do set LZO compression for our tables, I just forgot it's a manual thing :). -Dmitriy On Tue, May 4, 2010 at 7:34 AM, Stack <st...@duboce.net> wrote: > On Tue, May 4, 2010 at 12:02 AM, Dmitriy Ryaboy <dmit...@twitter.com> > wrote: > > > > We have an apache license file in the root of the project; I am not sure > if > > we need to put it in every file. Will check with the lawyers. > > > Generally you put notice at head of each src file. I was particularly > referring to this that we have in all hbase files: > > " * Copyright 2010 The Apache Software Foundation" > > We got this practise from hadoop but looking there, they no longer > seem to do it (I need to talk to lawyers too -- smile). > > > Regarding the first and last slice, the problem is that I have no way of > > knowing what the first and last, respectively, key values are. With the > > first slice I can maybe cache the first key I see, and use that in > > conjunction with the end of the region to calculate the size of the > > keyspace; but with that last region, the max is infinity, so I can't > really > > estimate how much more I have left until I have none.. do regions store > any > > metadata that countains a rough count of the number of records they hold? > > Regions no. StoreFiles yes. They have number of entries but this is > not really available via API. We should expose it or something like > it. It could only be an estimate since delete and put both records > and a delete can remove cell, column or family. > > > I > > guess they only keep track of the byte size of the data, not the number > of > > records per se. Maybe I can get the total byte size of the region, and > > calculate offsets based on the size of the returned data? This would be > > likely wrong due to pushed down projections and filters, of course. Any > > other ideas? How do people normally handle this when writing regular MR > jobs > > that scan HBase tables? > > > > I think most tasks that go against hbase should 0% progress and then > 100% when done. > > We could expose a getLastRowInRegion or what if we added an estimated > row count to the Split (Maybe thats not the right place to expose this > info? What is the canonical way?). > > > I suspect this is actually a bit of a problem, btw -- since I don't > report > > the amount of remaining work for these slices accurately, and I > (hopefully) > > do a reasonable job for the ones where I can calculate the size of the > > keyspace, speculative execution may get overeager with these two slices. > > > Good point. > > We should fix this. Keeping a counter of how many rows in a region > wouldn't be hard. It could be updated on compaction, etc. A row > count would be good enough. > > > > PigCounterHelper just deals with some oddities of Hadoop counters (they > may > > not be available when you first try to increment a counter -- the helper > > buffers increment requests until the reporter becomes available). Are > HBase > > counters special things or also just Hadoop counters under the covers? > > > Check them out. They are not hadoop counters. Keep up a count on > anything. Might be of use given what you are doing. Update it > thousands of times a second, etc. > > > > The lzo files are probably unrelated.. there shouldn't be anything > > LZO-specific in the HBase code. We are, in fact, lzo'ing hbase content in > > the sense that that's the compression we have for HDFS, and I think HBase > is > > supposed to inherit that. > > No. You need to enable it on the column family. See how COMPRESSION > can be NONE, GZ, or LZO. LZO needs to be installed as per hadoop. > Search the hbase wiki home page for lzo. The hbase+lzo page has had > some experience baked in to it so may be of some use to you. > > St.Ack >