Yep, definitely looks like an improvement! What was the actual cause of John's issue in the beginning? Is there a physical limit (or bug) in the serialization of Configuration values?
- Gabriel On 23 May 2013, at 20:26, Josh Wills <[email protected]> wrote: > Glorious. That had been on my TODO list for awhile, I'm glad we found a > problem that forced me to fix it. ;-) Will commit to master. We should also > probably consider a point release (0.6.1) with that fix, esp. due to the > startup improvements. > > J > > > On Thu, May 23, 2013 at 11:00 AM, John Jensen <[email protected]>wrote: > >> >> Thanks, Josh. That worked perfectly! >> >> It has the added benefit of dramatically improving the startup time. I >> assume because we're no longer copying the monstrous jobconfs around. >> >> -- John >> >> ________________________________________ >> From: Josh Wills (JIRA) [[email protected]] >> Sent: Wednesday, May 22, 2013 5:27 PM >> To: [email protected] >> Subject: [jira] [Updated] (CRUNCH-209) Jobs with large numbers of >> directory inputs will fail with odd inputsplit exceptions >> >> [ >> https://issues.apache.org/jira/browse/CRUNCH-209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] >> >> Josh Wills updated CRUNCH-209: >> ------------------------------ >> >> Attachment: CRUNCH-209.patch >> >> A hypothetical fix for John to test out. >> >>> Jobs with large numbers of directory inputs will fail with odd >> inputsplit exceptions >>> >> ------------------------------------------------------------------------------------ >>> >>> Key: CRUNCH-209 >>> URL: https://issues.apache.org/jira/browse/CRUNCH-209 >>> Project: Crunch >>> Issue Type: Bug >>> Components: Core >>> Affects Versions: 0.5.0, 0.6.0 >>> Reporter: Josh Wills >>> Assignee: Josh Wills >>> Attachments: CRUNCH-209.patch >>> >>> >>> From John Jensen on the user mailing list: >>> I have a curious problem when running a crunch job on (avro) files in a >> fairly large set of directories (just slightly less than 100). >>> After running some fraction of the mappers they start failing with the >> exception below. Things work fine with a smaller number of directories. >>> The magic >> 'zdHJpbmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI' >> string shows up in the 'crunch.inputs.dir' entry in the job config, so I >> assume it has something to do with deserializing that value, but reading >> through the code I don't see any obvious way how. >>> Furthermore, the crunch.inputs.dir config entry is just under 1.5M, so >> it would not surprise me if I'm running up against a hadoop limit somewhere. >>> Stack trace: >>> java.io.IOException: Split class zdHJp >>> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not >> found >>> at >> org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:342) >>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:614) >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) >>> at org.apache.hadoop.mapred.Child$4.run(Child.java:268) >>> at java.security.AccessController.doPrivileged(Native Method) >>> at javax.security.auth.Subject.doAs(Subject.java:415) >>> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) >>> at org.apache.hadoop.mapred.Child.main(Child.java:262) >>> Caused by: java.lang.ClassNotFoundException: Class zdHJp >>> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not >> found >>> at >> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493) >>> at >> org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:340) >>> ... 7 more >> >> -- >> This message is automatically generated by JIRA. >> If you think it was sent incorrectly, please contact your JIRA >> administrators >> For more information on JIRA, see: http://www.atlassian.com/software/jira >> > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills>
