[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

feng ye (JIRA) Fri, 18 May 2018 15:05:54 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16481255#comment-16481255
 ]


feng ye commented on TIKA-2643:
-------------------------------

turned out CDH 5.8 has Tika 1.5 bundled in. Although it is not in MapReduce 
class path, it somehow interferes with the Tika version we deployed (version 
1.17). This leads to an apparent hang of over 10 mins in 
Tika.parseToString(InputStream) call, during which the debug log stops at the 
line I pointed out in earlier comment, and eventually the JVM for the MR 
crashed. I am attaching the crash log: [^hs_err_pid32104.log]. Would you please 
take a look and see if you can get some clue? Thanks.

> Tika call hangs when processes a pdf on Cloudera Hadoop
> -------------------------------------------------------
>
>                 Key: TIKA-2643
>                 URL: https://issues.apache.org/jira/browse/TIKA-2643
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.17
>         Environment: Cloudera Hadoop 5.8
>            Reporter: feng ye
>            Priority: Blocker
>         Attachments: hang-stdout.txt, hang.zip, hs_err_pid32104.log, 
> testJournalParser.pdf
>
>
> Tika.parseToString(InputStream) hangs when called within a MapReduce job to 
> process a pdf file from Cloudera Hadoop 5.8 (observed on 5.4 too). It can 
> process some other pdf files on the same cluster. I am attaching the file and 
> the syslog as well as stdout logs. Interesting that the same file can be 
> processed fine over a Hortonworks cluster. 
> This issue is a blocker for us to make our feature based on Tika available to 
> Cloudera cluster, a major flavor of Hadoop, so your timely attention would be 
> very much appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

Reply via email to