[
https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16481255#comment-16481255
]
feng ye commented on TIKA-2643:
-------------------------------
turned out CDH 5.8 has Tika 1.5 bundled in. Although it is not in MapReduce
class path, it somehow interferes with the Tika version we deployed (version
1.17). This leads to an apparent hang of over 10 mins in
Tika.parseToString(InputStream) call, during which the debug log stops at the
line I pointed out in earlier comment, and eventually the JVM for the MR
crashed. I am attaching the crash log: [^hs_err_pid32104.log]. Would you please
take a look and see if you can get some clue? Thanks.
> Tika call hangs when processes a pdf on Cloudera Hadoop
> -------------------------------------------------------
>
> Key: TIKA-2643
> URL: https://issues.apache.org/jira/browse/TIKA-2643
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.17
> Environment: Cloudera Hadoop 5.8
> Reporter: feng ye
> Priority: Blocker
> Attachments: hang-stdout.txt, hang.zip, hs_err_pid32104.log,
> testJournalParser.pdf
>
>
> Tika.parseToString(InputStream) hangs when called within a MapReduce job to
> process a pdf file from Cloudera Hadoop 5.8 (observed on 5.4 too). It can
> process some other pdf files on the same cluster. I am attaching the file and
> the syslog as well as stdout logs. Interesting that the same file can be
> processed fine over a Hortonworks cluster.
> This issue is a blocker for us to make our feature based on Tika available to
> Cloudera cluster, a major flavor of Hadoop, so your timely attention would be
> very much appreciated.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)