Raghu Rajagopalan
Tue, 01 Jul 2008 12:13:36 -0700
Hi, I wrote a small pig script with a couple of functions and it works fine in the local mode. However, when I run it on a hadoop cluster on a 4Gig file (apache access log). The job is submitted successfully, and the input is split to 66 map tasks (64 mb chunk size). On my cluster of 10 machines, the first 10 maps commence - however, they do not seem to terminate (progress goes to 1200% on the Hadoop map red tasks). I dont see anything untoward in teh logs either. On the command line, Pig's progress indicator sysouts continue indefinitely. Pig script and the referred functions are attached. I'm wondering if anyone's seen anything similar and/or any steps needed to fix this. CsvLogStorage.java - Load function using opencsv to parse apache log REGEX.java - regex splitter that outputs a tuple with a given regex SPLITDATE.java - parse a date and output tuple with given date parts. My guess is that there's something wrong with the way the custom load function is written. My setup: Hadoop 0.17 Pig.jar from the pigtutorial.tar.gz on the wiki. Thanks for looking. Raghu
pigfunctions-src.tgz
Description: GNU Zip compressed data