Thanks, Josh. That worked perfectly! It has the added benefit of dramatically improving the startup time. I assume because we're no longer copying the monstrous jobconfs around.
-- John ________________________________________ From: Josh Wills (JIRA) [[email protected]] Sent: Wednesday, May 22, 2013 5:27 PM To: [email protected] Subject: [jira] [Updated] (CRUNCH-209) Jobs with large numbers of directory inputs will fail with odd inputsplit exceptions [ https://issues.apache.org/jira/browse/CRUNCH-209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Wills updated CRUNCH-209: ------------------------------ Attachment: CRUNCH-209.patch A hypothetical fix for John to test out. > Jobs with large numbers of directory inputs will fail with odd inputsplit > exceptions > ------------------------------------------------------------------------------------ > > Key: CRUNCH-209 > URL: https://issues.apache.org/jira/browse/CRUNCH-209 > Project: Crunch > Issue Type: Bug > Components: Core > Affects Versions: 0.5.0, 0.6.0 > Reporter: Josh Wills > Assignee: Josh Wills > Attachments: CRUNCH-209.patch > > > From John Jensen on the user mailing list: > I have a curious problem when running a crunch job on (avro) files in a > fairly large set of directories (just slightly less than 100). > After running some fraction of the mappers they start failing with the > exception below. Things work fine with a smaller number of directories. > The magic > 'zdHJpbmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI' > string shows up in the 'crunch.inputs.dir' entry in the job config, so I > assume it has something to do with deserializing that value, but reading > through the code I don't see any obvious way how. > Furthermore, the crunch.inputs.dir config entry is just under 1.5M, so it > would not surprise me if I'm running up against a hadoop limit somewhere. > Stack trace: > java.io.IOException: Split class zdHJp > bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found > at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:342) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:614) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) > at org.apache.hadoop.mapred.Child$4.run(Child.java:268) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) > at org.apache.hadoop.mapred.Child.main(Child.java:262) > Caused by: java.lang.ClassNotFoundException: Class zdHJp > bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found > at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493) > at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:340) > ... 7 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
