[
https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16482184#comment-16482184
]
feng ye commented on TIKA-2643:
-------------------------------
Hi Ken,
Really appreciate your efforts looking into this. The MR classpath is in the
file:
CLASSPATH=/yarn/nm/usercache/fengye/appcache/application_1526666996363_0001/container_1526666996363_0001_01_000002:job.jar/job.jar:job.jar/classes/:job.jar/lib/*:/yarn/nm/usercache/fengye/appcache/application_1526666996363_0001/container_1526666996363_0001_01_000002/*:/etc/hadoop/conf.cloudera.yarn:/var/run/cloudera-scm-agent/process/992-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hadoop/*:/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hadoop-hdfs/*:/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hadoop-yarn/*:/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hadoop-yarn/lib/*:/var/run/cloudera-scm-agent/process/992-yarn-NODEMANAGER:/opt/epauto/vb015/S1428351/SASEPHome/jars/sasep.jar:/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/bin/../lib/hive/lib/*:/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hive-hcatalog/libexec/../share/hcatalog/*:/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hadoop-mapreduce/*:/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hadoop-mapreduce/lib/*:
Which does not seem to include Tika 1.5. Also, we tried not to pass our Tika
jars to MR and we got Tika.class not found error.
Yes the jar conflicts seem to be contributing to the crash. Apparently CDH 5.8
has these jars there out of box, which collide with the dependency jars of my
version of Tika jars. Apparently my user.class.first setting did not take
effect on this cluster.
Interesting I did not find the trace of Tika.parseToString call in this file,
which was the active method call at the time of crash.
> Tika call hangs when processes a pdf on Cloudera Hadoop
> -------------------------------------------------------
>
> Key: TIKA-2643
> URL: https://issues.apache.org/jira/browse/TIKA-2643
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.17
> Environment: Cloudera Hadoop 5.8
> Reporter: feng ye
> Priority: Blocker
> Attachments: hang-stdout.txt, hang.zip, hs_err_pid32104.log,
> testJournalParser.pdf
>
>
> Tika.parseToString(InputStream) hangs when called within a MapReduce job to
> process a pdf file from Cloudera Hadoop 5.8 (observed on 5.4 too). It can
> process some other pdf files on the same cluster. I am attaching the file and
> the syslog as well as stdout logs. Interesting that the same file can be
> processed fine over a Hortonworks cluster.
> This issue is a blocker for us to make our feature based on Tika available to
> Cloudera cluster, a major flavor of Hadoop, so your timely attention would be
> very much appreciated.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)