[ 
https://issues.apache.org/jira/browse/NUTCH-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200034#comment-15200034
 ] 

eldk edited comment on NUTCH-2138 at 3/17/16 6:40 PM:
------------------------------------------------------

2016-03-17 18:44:29,656 INFO  parse.ParserFactory - The parsing plugins: 
[org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes 
system property, and all claim to support the content type application/pdf, but 
they are not mapped to it  in the parse-plugins.xml file

DEBUG tika.TikaParser - Using Tika parser org.apache.tika.parser.pdf.PDFParser 
for mime-type application/pdf


was (Author: eldk):
2016-03-17 18:44:29,656 INFO  parse.ParserFactory - The parsing plugins: 
[org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes 
system property, and all claim to support the content type application/pdf, but 
they are not mapped to it  in the parse-plugins.xml file

> Tika cannot OCR embedded images from PDF
> ----------------------------------------
>
>                 Key: NUTCH-2138
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2138
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.10
>         Environment: Nutch v1.10
> openjdk version "1.8.0_60-internal"
> Debian 7.8
> Tika 1.8 or Tika 1.10
>            Reporter: jean blue
>
> Tika 1.10 is able to OCR embedded images if PDFParser.properties is modified 
> accordingly in tika-app-1.10.jar but parse-tika doesn't if same modifications 
> are made in runtime/local/plugins/parse-tika/tika-parsers-1.10.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to