[
https://issues.apache.org/jira/browse/CRUNCH-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830068#comment-13830068
]
Josh Wills commented on CRUNCH-209:
-----------------------------------
Hmm-- this was awhile ago, and my initial fix was a hypothesis that just
happened to work. My hypothesis was that we were bumping up against the limits
of the key size for an entry in the job.xml file, and so the fix was to shrink
the size of our entries by serializing a lot less data. My guess would be that
you are running into a similar issue with Cascading (and that Crunch would hit
it again too for a job that had enough input directories.)
I don't know a ton about how Cascading works, but if this situation was
happening to you with Crunch, my recommendation would be to split up your input
directories into different sources (pipes?) and then union those sources
together. Crunch breaks up the directories for different sources into different
keys in the job.xml file, so that would be a slightly hacky way of getting
around the key size limits.
> Jobs with large numbers of directory inputs will fail with odd inputsplit
> exceptions
> ------------------------------------------------------------------------------------
>
> Key: CRUNCH-209
> URL: https://issues.apache.org/jira/browse/CRUNCH-209
> Project: Crunch
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.5.0, 0.6.0
> Reporter: Josh Wills
> Assignee: Josh Wills
> Fix For: 0.7.0
>
> Attachments: CRUNCH-209.patch
>
>
> From John Jensen on the user mailing list:
> I have a curious problem when running a crunch job on (avro) files in a
> fairly large set of directories (just slightly less than 100).
> After running some fraction of the mappers they start failing with the
> exception below. Things work fine with a smaller number of directories.
> The magic
> 'zdHJpbmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI'
> string shows up in the 'crunch.inputs.dir' entry in the job config, so I
> assume it has something to do with deserializing that value, but reading
> through the code I don't see any obvious way how.
> Furthermore, the crunch.inputs.dir config entry is just under 1.5M, so it
> would not surprise me if I'm running up against a hadoop limit somewhere.
> Stack trace:
> java.io.IOException: Split class zdHJp
> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
> at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:342)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:614)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> at org.apache.hadoop.mapred.Child.main(Child.java:262)
> Caused by: java.lang.ClassNotFoundException: Class zdHJp
> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
> at
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
> at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:340)
> ... 7 more
--
This message was sent by Atlassian JIRA
(v6.1#6144)