Hello fellow pig users.  I am new to both hadoop and pig, with a
background in relational databases and perl scripting. Yesterday I ran a
fairly simple pig script that ran in around 45 minutes on our new 10
node cluster with the script processing approx 630G of raw data.  Around
1 to 2 minutes after submitting the job, I could see the map/reduce
processes running on the data node machines and the % done count began
to increment in the grunt shell.  Today, I ran the exact same pig script
against the exact same dataset.  However, this time I saw no activity on
the data nodes for over 50 minutes.  The script sat at 0% complete for
those 50 minutes, then I finally saw process on the data nodes.  From
that point, the script completed in around 45 minutes, just as it did
the day before.  I am the only user of the system, and no other jobs
were running at the time.  I have also noticed that doing a simple 'ls'
on a directory from grunt takes much much (as in many orders of
magnitude) longer to return the list of files than 'hadoop fs -ls' on
the same directory.
 
The only thing that changed between yesterday and today was I loaded
additional data into the HDFS (another 680G), but that data was not
processed by the pig script in question as it was loaded into a
different directory path.  I have seen this same 'sit and wait' behavior
from pig on a 4 node test cluster I was using prior to the current
cluster.  Any ideas what is going on here?  I am using Hadoop 20.1 and
pig 0.5.

PS.  My emails keep getting rejected by the list server as 'spam', and I
have to keep editing them until one finally goes through.  Can anything
be done about that?




Reply via email to