[
https://issues.apache.org/jira/browse/NUTCH-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200034#comment-15200034
]
eldk edited comment on NUTCH-2138 at 3/18/16 4:35 PM:
------------------------------------------------------
2016-03-17 18:44:29,656 INFO parse.ParserFactory - The parsing plugins:
[org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes
system property, and all claim to support the content type application/pdf, but
they are not mapped to it in the parse-plugins.xml file
DEBUG tika.TikaParser - Using Tika parser org.apache.tika.parser.pdf.PDFParser
for mime-type application/pdf
https://issues.apache.org/jira/browse/TIKA-93
https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/Tika.java
was (Author: eldk):
2016-03-17 18:44:29,656 INFO parse.ParserFactory - The parsing plugins:
[org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes
system property, and all claim to support the content type application/pdf, but
they are not mapped to it in the parse-plugins.xml file
DEBUG tika.TikaParser - Using Tika parser org.apache.tika.parser.pdf.PDFParser
for mime-type application/pdf
https://issues.apache.org/jira/browse/TIKA-93
> Tika cannot OCR embedded images from PDF
> ----------------------------------------
>
> Key: NUTCH-2138
> URL: https://issues.apache.org/jira/browse/NUTCH-2138
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.10
> Environment: Nutch v1.10
> openjdk version "1.8.0_60-internal"
> Debian 7.8
> Tika 1.8 or Tika 1.10
> Reporter: jean blue
>
> Tika 1.10 is able to OCR embedded images if PDFParser.properties is modified
> accordingly in tika-app-1.10.jar but parse-tika doesn't if same modifications
> are made in runtime/local/plugins/parse-tika/tika-parsers-1.10.jar
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)