[ 
https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16477486#comment-16477486
 ] 

Tim Allison commented on TIKA-2643:
-----------------------------------

bq. The tricky part is I cannot attach a debugger against this call within 
MapReduce job over the cluster. 

Ugh.  Right. Of course.  Anything more you can do with logging?  I didn't read 
through your logs well enough, but can you confirm that the hang is happening 
during parseToString() and not immediately after it?

Without understanding your full framework, I can't think of what might be 
causing this with any accuracy. :)

Some things that have caused permanent hangs for me in the past:
1) not clearing stderr/stdout from a child process
2) infinite loops in parsers 
3) blocking IO that, well, blocks
4) calling take() instead of poll() on an ExecutorCompletionService that is 
blocking
5) well, more generally, calling any of the blocking methods on theoretically 
concurrent/non-blocking objects, ArrayBlockingQueue, etc. instead of calling 
the non-blocking alternatives
6) Not-quite a permanent hang, but crazy churn caused by multithreaded garbage 
collection

I don't think this is the fault of the parser (2 above).  We can see from the 
logs, that the parser is making at least some progress into the file.

Do any of the above look like candidates for you?

> Tika call hangs when processes a pdf on Cloudera Hadoop
> -------------------------------------------------------
>
>                 Key: TIKA-2643
>                 URL: https://issues.apache.org/jira/browse/TIKA-2643
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.17
>         Environment: Cloudera Hadoop 5.8
>            Reporter: feng ye
>            Priority: Blocker
>         Attachments: hang-stdout.txt, hang.zip, testJournalParser.pdf
>
>
> Tika.parseToString(InputStream) hangs when called within a MapReduce job to 
> process a pdf file from Cloudera Hadoop 5.8 (observed on 5.4 too). It can 
> process some other pdf files on the same cluster. I am attaching the file and 
> the syslog as well as stdout logs. Interesting that the same file can be 
> processed fine over a Hortonworks cluster. 
> This issue is a blocker for us to make our feature based on Tika available to 
> Cloudera cluster, a major flavor of Hadoop, so your timely attention would be 
> very much appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to