Hello fellow pig users. I am new to both hadoop and pig, with a background in relational databases and perl scripting. Yesterday I ran a fairly simple pig script that ran in around 45 minutes on our new 10 node cluster with the script processing approx 630G of raw data. Around 1 to 2 minutes after submitting the job, I could see the map/reduce processes running on the data node machines and the % done count began to increment in the grunt shell. Today, I ran the exact same pig script against the exact same dataset. However, this time I saw no activity on the data nodes for over 50 minutes. The script sat at 0% complete for those 50 minutes, then I finally saw process on the data nodes. From that point, the script completed in around 45 minutes, just as it did the day before. I am the only user of the system, and no other jobs were running at the time. I have also noticed that doing a simple 'ls' on a directory from grunt takes much much (as in many orders of magnitude) longer to return the list of files than 'hadoop fs -ls' on the same directory. The only thing that changed between yesterday and today was I loaded additional data into the HDFS (another 680G), but that data was not processed by the pig script in question as it was loaded into a different directory path. I have seen this same 'sit and wait' behavior from pig on a 4 node test cluster I was using prior to the current cluster. Any ideas what is going on here? I am using Hadoop 20.1 and pig 0.5.
PS. My emails keep getting rejected by the list server as 'spam', and I have to keep editing them until one finally goes through. Can anything be done about that?
