On Tue, May 4, 2010 at 12:02 AM, Dmitriy Ryaboy <dmit...@twitter.com> wrote: > > We have an apache license file in the root of the project; I am not sure if > we need to put it in every file. Will check with the lawyers. > Generally you put notice at head of each src file. I was particularly referring to this that we have in all hbase files:
" * Copyright 2010 The Apache Software Foundation" We got this practise from hadoop but looking there, they no longer seem to do it (I need to talk to lawyers too -- smile). > Regarding the first and last slice, the problem is that I have no way of > knowing what the first and last, respectively, key values are. With the > first slice I can maybe cache the first key I see, and use that in > conjunction with the end of the region to calculate the size of the > keyspace; but with that last region, the max is infinity, so I can't really > estimate how much more I have left until I have none.. do regions store any > metadata that countains a rough count of the number of records they hold? Regions no. StoreFiles yes. They have number of entries but this is not really available via API. We should expose it or something like it. It could only be an estimate since delete and put both records and a delete can remove cell, column or family. I > guess they only keep track of the byte size of the data, not the number of > records per se. Maybe I can get the total byte size of the region, and > calculate offsets based on the size of the returned data? This would be > likely wrong due to pushed down projections and filters, of course. Any > other ideas? How do people normally handle this when writing regular MR jobs > that scan HBase tables? > I think most tasks that go against hbase should 0% progress and then 100% when done. We could expose a getLastRowInRegion or what if we added an estimated row count to the Split (Maybe thats not the right place to expose this info? What is the canonical way?). > I suspect this is actually a bit of a problem, btw -- since I don't report > the amount of remaining work for these slices accurately, and I (hopefully) > do a reasonable job for the ones where I can calculate the size of the > keyspace, speculative execution may get overeager with these two slices. > Good point. We should fix this. Keeping a counter of how many rows in a region wouldn't be hard. It could be updated on compaction, etc. A row count would be good enough. > PigCounterHelper just deals with some oddities of Hadoop counters (they may > not be available when you first try to increment a counter -- the helper > buffers increment requests until the reporter becomes available). Are HBase > counters special things or also just Hadoop counters under the covers? > Check them out. They are not hadoop counters. Keep up a count on anything. Might be of use given what you are doing. Update it thousands of times a second, etc. > The lzo files are probably unrelated.. there shouldn't be anything > LZO-specific in the HBase code. We are, in fact, lzo'ing hbase content in > the sense that that's the compression we have for HDFS, and I think HBase is > supposed to inherit that. No. You need to enable it on the column family. See how COMPRESSION can be NONE, GZ, or LZO. LZO needs to be installed as per hadoop. Search the hbase wiki home page for lzo. The hbase+lzo page has had some experience baked in to it so may be of some use to you. St.Ack