It turned out to be a deployment issue of an old version. Ted and Chris's suggestions were spot-on.
I can't believe how BRILLIANT these combiners from Cascading are. It's cut my processing time down from 20 hours to 50 minutes. AND I cut out about 80% of my hand-crafted code. Bravo. I look smart now. (Almost). -B On Sun, Sep 26, 2010 at 7:00 PM, Ted Dunning <[email protected]> wrote: > If there are combiners, the reducers shouldn't get any lists longer than a > small multiple of the number of maps. > > On Sun, Sep 26, 2010 at 6:01 PM, Bradford Stephens < > [email protected]> wrote: > >> One of the problems with this data set is that I'm grouping by a >> category that has only, say, 20 different values. Then I'm doing a >> unique count of Facebook user IDs per group. I imagine that's not >> pleasant for the reducers. >> >> On Sun, Sep 26, 2010 at 5:41 PM, Alex Kozlov <[email protected]> wrote: >> > Hi Bradford, >> > >> > Sometimes the reducers do not handle merging large chunks of data too >> well: >> > How many reducers do you have? Try to increase the # of reducers (you >> can >> > always merge the data later if you are worried about too many output >> files). >> > >> > -- >> > Alex Kozlov >> > Solutions Architect >> > Cloudera, Inc >> > twitter: alexvk2009 >> > >> > Hadoop World 2010, October 12, New York City - Register now: >> > http://www.cloudera.com/company/press-center/hadoop-world-nyc/ >> > >> > >> > On Sun, Sep 26, 2010 at 5:09 PM, Chris K Wensel <[email protected]> >> wrote: >> > >> >> Try using a lower threshold value (the num of values in the LRU to >> cache). >> >> this is the tradeoff of this approach. >> >> >> >> ckw >> >> >> >> On Sep 26, 2010, at 4:46 PM, Bradford Stephens wrote: >> >> >> >> > Sadly, making Chris's changes didn't help. >> >> > >> >> > Here's the Cascading code, it's pretty simple but uses the new >> >> > "combiner"-like functionality: >> >> > >> >> > http://pastebin.com/ccvDmLSX >> >> > >> >> > >> >> > >> >> > On Sun, Sep 26, 2010 at 9:37 AM, Ted Dunning <[email protected]> >> >> wrote: >> >> >> My feeling is that you have some kind of leak going on in your >> mappers >> >> or >> >> >> reducers and that reducing the number of times the jvm is re-used >> would >> >> >> improve matters. >> >> >> >> >> >> GC overhead limit indicates that your (tiny) heap is full and >> collection >> >> is >> >> >> not reducing that. >> >> >> >> >> >> On Sun, Sep 26, 2010 at 12:55 AM, Bradford Stephens < >> >> >> [email protected]> wrote: >> >> >> >> >> >>> mapred.job.reuse.jvm.num.tasks=50 >> >> >>> >> >> >> >> >> > >> >> > >> >> > >> >> > -- >> >> > Bradford Stephens, >> >> > Founder, Drawn to Scale >> >> > drawntoscalehq.com >> >> > 727.697.7528 >> >> > >> >> > http://www.drawntoscalehq.com -- The intuitive, cloud-scale data >> >> > solution. Process, store, query, search, and serve all your data. >> >> > >> >> > http://www.roadtofailure.com -- The Fringes of Scalability, Social >> >> > Media, and Computer Science >> >> > >> >> > -- >> >> > You received this message because you are subscribed to the Google >> Groups >> >> "cascading-user" group. >> >> > To post to this group, send email to [email protected]. >> >> > To unsubscribe from this group, send email to >> >> [email protected]<cascading-user%[email protected]> >> <cascading-user%[email protected]<cascading-user%[email protected]> >> > >> >> . >> >> > For more options, visit this group at >> >> http://groups.google.com/group/cascading-user?hl=en. >> >> > >> >> >> >> -- >> >> Chris K Wensel >> >> [email protected] >> >> http://www.concurrentinc.com >> >> >> >> -- Concurrent, Inc. offers mentoring, support, and licensing for >> Cascading >> >> >> >> >> > >> >> >> >> -- >> Bradford Stephens, >> Founder, Drawn to Scale >> drawntoscalehq.com >> 727.697.7528 >> >> http://www.drawntoscalehq.com -- The intuitive, cloud-scale data >> solution. Process, store, query, search, and serve all your data. >> >> http://www.roadtofailure.com -- The Fringes of Scalability, Social >> Media, and Computer Science >> >> -- >> You received this message because you are subscribed to the Google Groups >> "cascading-user" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]<cascading-user%[email protected]> >> . >> For more options, visit this group at >> http://groups.google.com/group/cascading-user?hl=en. >> >> > -- Bradford Stephens, Founder, Drawn to Scale drawntoscalehq.com 727.697.7528 http://www.drawntoscalehq.com -- The intuitive, cloud-scale data solution. Process, store, query, search, and serve all your data. http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science
