[ 
https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16477513#comment-16477513
 ] 

Ken Krugler commented on TIKA-2643:
-----------------------------------

If I was going to guess, it's that your Cloudera installation has different 
jars/versions of jars on the classpath, and that's what is triggering the hang. 
When we use MR jobs to parse files, we always have to isolate the parse to 
avoid hanging. Since this isn't common, we do it via threading and accept that 
we'll wind up with some number of zombie threads, which is why we disable JVM 
reuse for these types of jobs. See 
[SimpleParser|https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java]
 and its use of 
[TikaCallable|https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java]
 in the [bixo|https://github.com/bixo/bixo] project.

> Tika call hangs when processes a pdf on Cloudera Hadoop
> -------------------------------------------------------
>
>                 Key: TIKA-2643
>                 URL: https://issues.apache.org/jira/browse/TIKA-2643
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.17
>         Environment: Cloudera Hadoop 5.8
>            Reporter: feng ye
>            Priority: Blocker
>         Attachments: hang-stdout.txt, hang.zip, testJournalParser.pdf
>
>
> Tika.parseToString(InputStream) hangs when called within a MapReduce job to 
> process a pdf file from Cloudera Hadoop 5.8 (observed on 5.4 too). It can 
> process some other pdf files on the same cluster. I am attaching the file and 
> the syslog as well as stdout logs. Interesting that the same file can be 
> processed fine over a Hortonworks cluster. 
> This issue is a blocker for us to make our feature based on Tika available to 
> Cloudera cluster, a major flavor of Hadoop, so your timely attention would be 
> very much appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to