Re: questions about pig

Eric Yang Mon, 03 Jan 2011 21:03:04 -0800

Hi Ari,

1) It should work with both mapreduce mode and local mode.  Make sure
PIG_CLASSPATH contains hadoop and hbase config directories.  In
addition, make sure you are loading
-Dpig.additional.jars=$PIG_PATH/pig-0.8-core.jar:$HBASE_HOME/hbase-0.20.6.jar

2) use hbase shell, and run:

scan "ClusterSummary";

There should be some data in all column family in text readable form.
It should be easy to write a unit test program to verify the data.
Sample from my cluster:

 1294116960000-chukwa        column=cpu:User, timestamp=1294117043968,
value=0.041503639495471464
 1294116960000-chukwa        column=disk:ReadBytes,
timestamp=1294117043968, value=28672.0
 1294116960000-chukwa        column=disk:Reads,
timestamp=1294117043968, value=3.0
 1294116960000-chukwa        column=disk:WriteBytes,
timestamp=1294117043968, value=2564096.0
 1294116960000-chukwa        column=disk:Writes,
timestamp=1294117043968, value=213.0
 1294116960000-chukwa        column=hdfs:BlockCapacity,
timestamp=1294117041309, value=32

The first column is row key, and it's composed of [timestamp]-[clustername]

3) The default epoch starts from where you want to start the
aggregation.  By default, the script uses 1234567890000, which
translate to:

Fri, 13 Feb 2009 23:31:30 GMT

If you only want to run aggregation from current day, you can run the
script with START=current epoch time in milliseconds.

4) Create a java program and implement bloom filter to scan hbase for
the last row.  I think that is very inefficient way to go about this.
It's best to launch the aggregation script in fix interval (crontab or
oozie) to continuously process the data, but use 2x interval for the
scanning range to make sure the late arrival data are covered.  For
example, if you run the aggregation script every 5 minutes, then use
START=current time-10 minutes.

regard,
Eric

On Mon, Jan 3, 2011 at 7:37 PM, Ariel Rabkin <[email protected]> wrote:
> Got a couple questions about the pig-based aggregation. These may
> slightly duplicate JIRA comments, so apologies and no need to answer
> more than once.
>
> 1) Can we run the aggregation scripts in local mode?  I haven't been
> able to get Pig to read from anything other than file:/// in local
> mode. Is there a trick to it?
>
> 2) Is there a good way to sanity check my tables and make sure the
> data in HBase looks right? Not quite sure what they "should" look
> like.
>
> 3) What's the default epoch to start aggregating from?  What happens
> if I don't specify START=  to the command?
>
> 4) Is there a good way to find out what the last epoch it started
> summarizing from was?  Is there a big cost to being over-inclusive?
>
>
> --Ari
>
> --
> Ari Rabkin [email protected]
> UC Berkeley Computer Science Department
>

Re: questions about pig

Reply via email to