[
https://issues.apache.org/jira/browse/CRUNCH-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830086#comment-13830086
]
Marshall Bockrath-Vandegrift commented on CRUNCH-209:
-----------------------------------------------------
Thanks for the response. There's certainly options for work-arounds, but I was
hoping to get to the bottom of the problem in the first place. The structure
of the error suggests something in the guts of Hadoop is incorrectly
serializing splits beyond a certain size. If that's the case, I'd like to find
the associated MAPREDUCE bug, or file one if it doesn't yet exist. Oh well --
more code-spelunking it seems.
> Jobs with large numbers of directory inputs will fail with odd inputsplit
> exceptions
> ------------------------------------------------------------------------------------
>
> Key: CRUNCH-209
> URL: https://issues.apache.org/jira/browse/CRUNCH-209
> Project: Crunch
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.5.0, 0.6.0
> Reporter: Josh Wills
> Assignee: Josh Wills
> Fix For: 0.7.0
>
> Attachments: CRUNCH-209.patch
>
>
> From John Jensen on the user mailing list:
> I have a curious problem when running a crunch job on (avro) files in a
> fairly large set of directories (just slightly less than 100).
> After running some fraction of the mappers they start failing with the
> exception below. Things work fine with a smaller number of directories.
> The magic
> 'zdHJpbmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI'
> string shows up in the 'crunch.inputs.dir' entry in the job config, so I
> assume it has something to do with deserializing that value, but reading
> through the code I don't see any obvious way how.
> Furthermore, the crunch.inputs.dir config entry is just under 1.5M, so it
> would not surprise me if I'm running up against a hadoop limit somewhere.
> Stack trace:
> java.io.IOException: Split class zdHJp
> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
> at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:342)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:614)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> at org.apache.hadoop.mapred.Child.main(Child.java:262)
> Caused by: java.lang.ClassNotFoundException: Class zdHJp
> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
> at
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
> at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:340)
> ... 7 more
--
This message was sent by Atlassian JIRA
(v6.1#6144)