pig-user  

Map tasks dont complete

Raghu Rajagopalan
Tue, 01 Jul 2008 12:13:36 -0700

Hi,
I wrote a small pig script with a couple of functions and it works
fine in the local mode.
However, when I run it on a hadoop cluster on a 4Gig file (apache
access log). The job is submitted successfully, and the input is split
to 66 map tasks (64 mb chunk size). On my cluster of 10 machines, the
first 10 maps commence - however, they do not seem to terminate
(progress goes to 1200% on the Hadoop map red tasks). I dont see
anything untoward in teh logs either.

On the command line, Pig's progress indicator sysouts continue indefinitely.

Pig script and the referred functions are attached. I'm wondering if
anyone's seen anything similar and/or any steps needed to fix this.

CsvLogStorage.java - Load function using opencsv to parse apache log
REGEX.java - regex splitter that outputs a tuple with a given regex
SPLITDATE.java - parse a date and output tuple with given date parts.

My guess is that there's something wrong with the way the custom load
function is written.

My setup:
Hadoop 0.17
Pig.jar from the pigtutorial.tar.gz on the wiki.

Thanks for looking.
Raghu

Attachment: pigfunctions-src.tgz
Description: GNU Zip compressed data