Glorious. That had been on my TODO list for awhile, I'm glad we found a problem that forced me to fix it. ;-) Will commit to master. We should also probably consider a point release (0.6.1) with that fix, esp. due to the startup improvements.
J On Thu, May 23, 2013 at 11:00 AM, John Jensen <[email protected]>wrote: > > Thanks, Josh. That worked perfectly! > > It has the added benefit of dramatically improving the startup time. I > assume because we're no longer copying the monstrous jobconfs around. > > -- John > > ________________________________________ > From: Josh Wills (JIRA) [[email protected]] > Sent: Wednesday, May 22, 2013 5:27 PM > To: [email protected] > Subject: [jira] [Updated] (CRUNCH-209) Jobs with large numbers of > directory inputs will fail with odd inputsplit exceptions > > [ > https://issues.apache.org/jira/browse/CRUNCH-209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] > > Josh Wills updated CRUNCH-209: > ------------------------------ > > Attachment: CRUNCH-209.patch > > A hypothetical fix for John to test out. > > > Jobs with large numbers of directory inputs will fail with odd > inputsplit exceptions > > > ------------------------------------------------------------------------------------ > > > > Key: CRUNCH-209 > > URL: https://issues.apache.org/jira/browse/CRUNCH-209 > > Project: Crunch > > Issue Type: Bug > > Components: Core > > Affects Versions: 0.5.0, 0.6.0 > > Reporter: Josh Wills > > Assignee: Josh Wills > > Attachments: CRUNCH-209.patch > > > > > > From John Jensen on the user mailing list: > > I have a curious problem when running a crunch job on (avro) files in a > fairly large set of directories (just slightly less than 100). > > After running some fraction of the mappers they start failing with the > exception below. Things work fine with a smaller number of directories. > > The magic > 'zdHJpbmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI' > string shows up in the 'crunch.inputs.dir' entry in the job config, so I > assume it has something to do with deserializing that value, but reading > through the code I don't see any obvious way how. > > Furthermore, the crunch.inputs.dir config entry is just under 1.5M, so > it would not surprise me if I'm running up against a hadoop limit somewhere. > > Stack trace: > > java.io.IOException: Split class zdHJp > > bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not > found > > at > org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:342) > > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:614) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) > > at org.apache.hadoop.mapred.Child$4.run(Child.java:268) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:415) > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) > > at org.apache.hadoop.mapred.Child.main(Child.java:262) > > Caused by: java.lang.ClassNotFoundException: Class zdHJp > > bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not > found > > at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493) > > at > org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:340) > > ... 7 more > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators > For more information on JIRA, see: http://www.atlassian.com/software/jira > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
