Classification: UNCLASSIFIED Caveats: NONE Not sure I can figure out the correlation you're looking for but I'll try. I do note that when the reduce tasks appear to stall the log's last entries look something like this:
INFO: org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 363426183 bytes INFO: org.apache.pig.impl.util.SpillableMemoryManager: low memory handler called (Usage threshold exceeded) init=5439488 (5312K) used=195762912 (191174K) committed=225574912 (220288K) max=279642112 (273088K) Sometimes the last message is repeated multiple times, sometimes not. Thanks, Robert -----Original Message----- From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] Sent: Thursday, March 11, 2010 2:50 PM To: pig-user@hadoop.apache.org Subject: Re: Reducers slowing down? (UNCLASSIFIED) Can you check the task logs and see how the number of databag spills to disk correlates with the number of tuples/bytes processed and the time a task took? Sounds like there is some terrible skew going on, although cross really shouldn't have that problem if it does what I think it should be doing (which is probably wrong, I never use cross). -D 2010/3/11 Winkler, Robert (Civ, ARL/CISD) <robert.wink...@us.army.mil> > Classification: UNCLASSIFIED > Caveats: NONE > > Yeah, that didn't work either. Ran for 3 days and then failed because > of "too many fetch failures". It seems to get about 2/3 of the way > through the reducers (regardless of the number) reasonably quickly and > then just stalls or fails. > > Anyway, I changed the script to SPLIT the People dataset into 26 > subsets based on whether the first character matched a-z and crossed > each of those subsets with the Actors relation. This resulted in 26 > separate Pig jobs running in parallel (I went back to PARALLEL 30 so each had 30 reducers). > > That worked. The shortest job took 53 minutes and the longest 22.5 hours. > But I'm not sure what to make of this other than I shouldn't try to > process a 500,000,000,000-tuple relation. > > -- Register CMU’s SecondString > REGISTER > > /home/arl/Desktop/ARLDeveloper/JavaCOTS/SecondString/secondstring-2006 > 0615.j > ar; > -- Register ARL’s UDF SecondString Wrapper REGISTER > > /home/arl/Desktop/ARLDeveloper/JavaComponents/INSCOM/CandidateIdentifi > cation > .jar; > -- |People| ~ 62,500,000 > People = LOAD '/data/UniquePeoplePerStory' USING PigStorage(',') AS > (file:chararray, name:chararray); > -- Split People based on first character SPLIT People into A IF name > MATCHES ‘^[a|A].*’, …. , Z IF name MATCHES ‘^[z|Z].*’; > -- |Actors| ~ 8,000 > Actors = LOAD '/data/Actors' USING PigStorage(',') AS > (actor:chararray); > -- Process each split in parallel > ToCompareA = CROSS Actors, A PARALLEL 30; AResults = FOREACH > ToCompareA GENERATE $0, $1, $2, > ARL.CandidateIdentificationUDF.Similarity($2, $0) ; STORE AResults > INTO '/data/ScoredPeople/A' USING PigStorage(','); … ToCompareZ = > CROSS Actors, Z PARALLEL 30; ZResults = FOREACH ToCompareZ GENERATE > $0, $1, $2, ARL.CandidateIdentificationUDF.Similarity($2, $0) ; STORE > ZResults INTO '/data/ScoredPeople/Z' USING PigStorage(','); > > -----Original Message----- > From: Mridul Muralidharan [mailto:mrid...@yahoo-inc.com] > Sent: Friday, March 05, 2010 9:39 PM > To: pig-user@hadoop.apache.org > Cc: Thejas Nair; Winkler, Robert (Civ, ARL/CISD) > Subject: Re: Reducers slowing down? (UNCLASSIFIED) > > On Saturday 06 March 2010 04:47 AM, Thejas Nair wrote: > > I am not sure why the rate at which output is generated is slowing down. > > But cross in pig is not optimized it uses only one reducer. (a > > major limitation if you are trying to process lots of data with a > > large > cluster!) > > > CROSS is not supposed to use a single reducer - GRCross is parallel in > pig, last time we checked (a while back though). > It is parallel does not mean it is not expensive, it is still pretty > darn expensive. > > Given this, the next might not work ? > > > Robert, what about using a higher value of PARALLEL for CROSS ? (much > higher than number of nodes, if required). > > Regards, > Mridul > > > > > You can try using skewed join instead project a constant in both > streams > > and then join on that. > > > > > > ToCompare = join Actors by 1, People by 1 using Œskewed¹ PARALLEL > > 30; > > > > I haven¹t tried this on very large dataset, I am interested knowing > > in > how > > this compares if you try it out. > > > > -Thejas > > > > > > > > > > On 3/5/10 9:48 AM, "Winkler, Robert (Civ, ARL/CISD)" > > <robert.wink...@us.army.mil> wrote: > > > >> Classification: UNCLASSIFIED > >> > >> Caveats: NONE > >> > >> Hello, I¹m using pig0.6.0 running the following script on a 27 > >> datanode cluster running RedHat Enterprise 5.4: > >> > >> -- Holds the Pig UDF wrapper around the SecondString SoftTFIDF > function > >> > >> REGISTER /home/CandidateIdentification.jar; > >> > >> -- SecondString itself > >> > >> REGISTER /home/secondstring-20060615.jar; > >> > >> -- |People| ~ 62,500,000 from the English GigaWord 4th Edition > >> > >> People = LOAD '/data/UniquePeoplePerStory' USING PigStorage(',') AS > >> (file:chararray, name:chararray); > >> > >> -- |Actors| ~ 8,000 from the Stanford Movie Database > >> > >> Actors = LOAD '/data/Actors' USING PigStorage(',') AS > >> (actor:chararray); > >> > >> -- |ToCompare| ~ 500,000,000,000 > >> > >> ToCompare = CROSS Actors, People PARALLEL 30; > >> > >> > >> > >> -- Score 'em and store 'em > >> > >> Results = FOREACH ToCompare GENERATE $0, $1, $2, > >> ARL.CandidateIdentificationUDF.Similarity($2, $0); > >> > >> STORE Results INTO '/data/ScoredPeople' USING PigStorage(','); > >> > >> The first 100,000,000,000 reduce output records were produced in > >> some 25 hours. But after 75 hours it has produced a total of > >> 140,000,000,000 > (instead > >> of the 300,000,000,000 I was extrapolating) and seems to be > >> producing > them at > >> a slower and slower rate. What is going on? Did I screw something up? > >> > >> Thanks, > >> > >> Robert > >> > >> Classification: UNCLASSIFIED > >> > >> Caveats: NONE > >> > >> > > > > > > Classification: UNCLASSIFIED > Caveats: NONE > > > Classification: UNCLASSIFIED Caveats: NONE
smime.p7s
Description: S/MIME cryptographic signature