[ 
https://issues.apache.org/jira/browse/NUTCH-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200034#comment-15200034
 ] 

eldk edited comment on NUTCH-2138 at 3/20/16 3:19 PM:
------------------------------------------------------

2016-03-17 18:44:29,656 INFO  parse.ParserFactory - The parsing plugins: 
[org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes 
system property, and all claim to support the content type application/pdf, but 
they are not mapped to it  in the parse-plugins.xml file

DEBUG tika.TikaParser - Using Tika parser org.apache.tika.parser.pdf.PDFParser 
for mime-type application/pdf

https://issues.apache.org/jira/browse/TIKA-93
https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/Tika.java

https://github.com/apache/tika/blob/master/tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
https://github.com/apache/tika/blob/master/tika-parsers/src/main/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties

Same with  with nutch 1.12, lib/tika-core-1.12.jar, 
plugins/parse-tika/tika-parsers-1.12.jar


was (Author: eldk):
2016-03-17 18:44:29,656 INFO  parse.ParserFactory - The parsing plugins: 
[org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes 
system property, and all claim to support the content type application/pdf, but 
they are not mapped to it  in the parse-plugins.xml file

DEBUG tika.TikaParser - Using Tika parser org.apache.tika.parser.pdf.PDFParser 
for mime-type application/pdf

https://issues.apache.org/jira/browse/TIKA-93
https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/Tika.java

https://github.com/apache/tika/blob/master/tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
https://github.com/apache/tika/blob/master/tika-parsers/src/main/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties
Same with  with nutch 1.12, lib/tika-core-1.12.jar, 
plugins/parse-tika/tika-parsers-1.12.jar

> Tika cannot OCR embedded images from PDF
> ----------------------------------------
>
>                 Key: NUTCH-2138
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2138
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.10
>         Environment: Nutch v1.10
> openjdk version "1.8.0_60-internal"
> Debian 7.8
> Tika 1.8 or Tika 1.10
>            Reporter: jean blue
>
> Tika 1.10 is able to OCR embedded images if PDFParser.properties is modified 
> accordingly in tika-app-1.10.jar but parse-tika doesn't if same modifications 
> are made in runtime/local/plugins/parse-tika/tika-parsers-1.10.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to