Hi Ari, 1) It should work with both mapreduce mode and local mode. Make sure PIG_CLASSPATH contains hadoop and hbase config directories. In addition, make sure you are loading -Dpig.additional.jars=$PIG_PATH/pig-0.8-core.jar:$HBASE_HOME/hbase-0.20.6.jar
2) use hbase shell, and run: scan "ClusterSummary"; There should be some data in all column family in text readable form. It should be easy to write a unit test program to verify the data. Sample from my cluster: 1294116960000-chukwa column=cpu:User, timestamp=1294117043968, value=0.041503639495471464 1294116960000-chukwa column=disk:ReadBytes, timestamp=1294117043968, value=28672.0 1294116960000-chukwa column=disk:Reads, timestamp=1294117043968, value=3.0 1294116960000-chukwa column=disk:WriteBytes, timestamp=1294117043968, value=2564096.0 1294116960000-chukwa column=disk:Writes, timestamp=1294117043968, value=213.0 1294116960000-chukwa column=hdfs:BlockCapacity, timestamp=1294117041309, value=32 The first column is row key, and it's composed of [timestamp]-[clustername] 3) The default epoch starts from where you want to start the aggregation. By default, the script uses 1234567890000, which translate to: Fri, 13 Feb 2009 23:31:30 GMT If you only want to run aggregation from current day, you can run the script with START=current epoch time in milliseconds. 4) Create a java program and implement bloom filter to scan hbase for the last row. I think that is very inefficient way to go about this. It's best to launch the aggregation script in fix interval (crontab or oozie) to continuously process the data, but use 2x interval for the scanning range to make sure the late arrival data are covered. For example, if you run the aggregation script every 5 minutes, then use START=current time-10 minutes. regard, Eric On Mon, Jan 3, 2011 at 7:37 PM, Ariel Rabkin <[email protected]> wrote: > Got a couple questions about the pig-based aggregation. These may > slightly duplicate JIRA comments, so apologies and no need to answer > more than once. > > 1) Can we run the aggregation scripts in local mode? I haven't been > able to get Pig to read from anything other than file:/// in local > mode. Is there a trick to it? > > 2) Is there a good way to sanity check my tables and make sure the > data in HBase looks right? Not quite sure what they "should" look > like. > > 3) What's the default epoch to start aggregating from? What happens > if I don't specify START= to the command? > > 4) Is there a good way to find out what the last epoch it started > summarizing from was? Is there a big cost to being over-inclusive? > > > --Ari > > -- > Ari Rabkin [email protected] > UC Berkeley Computer Science Department >
