There is a limit in MR1, mapred.user.jobconf.limit per http://hadoop.apache.org/docs/stable/mapred-default.html, that limits it to 5 MB (but this is applied at the JT level). I am not aware of any serialization-time limits and think there are none as I've seen Hive use the same code to write enormous sized files.
Worth noting that MR2, suitable to its service-less architecture, has no such limits on jobconf size and the property isn't present in it anymore. On Fri, May 24, 2013 at 12:13 AM, Josh Wills <[email protected]> wrote: > On Thu, May 23, 2013 at 11:39 AM, Gabriel Reid <[email protected] > >wrote: > > > Yep, definitely looks like an improvement! > > > > What was the actual cause of John's issue in the beginning? Is there a > > physical > > limit (or bug) in the serialization of Configuration values? > > > > It seems like there must be, although I couldn't figure out where it was > happening exactly, and Googling around for limits about jobconf > serialization didn't turn up anything, either. > > > > > > - Gabriel > > > > On 23 May 2013, at 20:26, Josh Wills <[email protected]> wrote: > > > > > Glorious. That had been on my TODO list for awhile, I'm glad we found a > > > problem that forced me to fix it. ;-) Will commit to master. We should > > also > > > probably consider a point release (0.6.1) with that fix, esp. due to > the > > > startup improvements. > > > > > > J > > > > > > > > > On Thu, May 23, 2013 at 11:00 AM, John Jensen < > [email protected] > > >wrote: > > > > > >> > > >> Thanks, Josh. That worked perfectly! > > >> > > >> It has the added benefit of dramatically improving the startup time. I > > >> assume because we're no longer copying the monstrous jobconfs around. > > >> > > >> -- John > > >> > > >> ________________________________________ > > >> From: Josh Wills (JIRA) [[email protected]] > > >> Sent: Wednesday, May 22, 2013 5:27 PM > > >> To: [email protected] > > >> Subject: [jira] [Updated] (CRUNCH-209) Jobs with large numbers of > > >> directory inputs will fail with odd inputsplit exceptions > > >> > > >> [ > > >> > > > https://issues.apache.org/jira/browse/CRUNCH-209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > > ] > > >> > > >> Josh Wills updated CRUNCH-209: > > >> ------------------------------ > > >> > > >> Attachment: CRUNCH-209.patch > > >> > > >> A hypothetical fix for John to test out. > > >> > > >>> Jobs with large numbers of directory inputs will fail with odd > > >> inputsplit exceptions > > >>> > > >> > > > ------------------------------------------------------------------------------------ > > >>> > > >>> Key: CRUNCH-209 > > >>> URL: https://issues.apache.org/jira/browse/CRUNCH-209 > > >>> Project: Crunch > > >>> Issue Type: Bug > > >>> Components: Core > > >>> Affects Versions: 0.5.0, 0.6.0 > > >>> Reporter: Josh Wills > > >>> Assignee: Josh Wills > > >>> Attachments: CRUNCH-209.patch > > >>> > > >>> > > >>> From John Jensen on the user mailing list: > > >>> I have a curious problem when running a crunch job on (avro) files > in a > > >> fairly large set of directories (just slightly less than 100). > > >>> After running some fraction of the mappers they start failing with > the > > >> exception below. Things work fine with a smaller number of > directories. > > >>> The magic > > >> > > > 'zdHJpbmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI' > > >> string shows up in the 'crunch.inputs.dir' entry in the job config, > so I > > >> assume it has something to do with deserializing that value, but > reading > > >> through the code I don't see any obvious way how. > > >>> Furthermore, the crunch.inputs.dir config entry is just under 1.5M, > so > > >> it would not surprise me if I'm running up against a hadoop limit > > somewhere. > > >>> Stack trace: > > >>> java.io.IOException: Split class zdHJp > > >>> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI > not > > >> found > > >>> at > > >> org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:342) > > >>> at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:614) > > >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) > > >>> at org.apache.hadoop.mapred.Child$4.run(Child.java:268) > > >>> at java.security.AccessController.doPrivileged(Native Method) > > >>> at javax.security.auth.Subject.doAs(Subject.java:415) > > >>> at > > >> > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) > > >>> at org.apache.hadoop.mapred.Child.main(Child.java:262) > > >>> Caused by: java.lang.ClassNotFoundException: Class zdHJp > > >>> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI > not > > >> found > > >>> at > > >> > > > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493) > > >>> at > > >> org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:340) > > >>> ... 7 more > > >> > > >> -- > > >> This message is automatically generated by JIRA. > > >> If you think it was sent incorrectly, please contact your JIRA > > >> administrators > > >> For more information on JIRA, see: > > http://www.atlassian.com/software/jira > > >> > > > > > > > > > > > > -- > > > Director of Data Science > > > Cloudera <http://www.cloudera.com> > > > Twitter: @josh_wills <http://twitter.com/josh_wills> > > > > > -- Harsh J
