[
https://issues.apache.org/jira/browse/NUTCH-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195439#comment-15195439
]
eldk edited comment on NUTCH-2138 at 3/15/16 3:18 PM:
------------------------------------------------------
Hello,
OCR for image in PDF still not working with nutch 1.11, lib/tika-core-1.11.jar,
plugins/parse-tika/tika-parsers-1.11.jar
tesseract -v
tesseract 3.03
leptonica-1.70
libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 :
webp 0.4.0
grep tika nutch-default.xml
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
bin/nutch parsechecker -dumpText http://domain.tld/file.pdf
fetching: http://domain.tld/file.pdf
robots.txt whitelist not configured.
parsing: http://domain.tld/file.pdf
contentType: application/pdf
signature: af00322e75c5eb43085df668f2faca2f
---------
Url
---------------
http://domain.tld/file.pdf
---------
ParseData
---------
Version: 5
Status: success(1,0)
Title:
Outlinks: 0
Content Metadata: nutch.fetch.time=1458053976974 Age=0 Content-Language=fr-FR
Served-by=domain.tld Content-Length=5052242 Content-Transfer-Encoding=binary
Expires=Tue, 15 Mar 2016 15:09:37 GMT Last-Modified=Fri, 12 Jun 2015 14:58:13
GMT Set-Cookie=eZSESSID=6ns8c06tnu40kd3ohfpl6vnrj5; path=/ Connection=close
X-Cache=Miss from Varnish Server=nginx X-Powered-By=eZ Publish Cache-Control=
Pragma= X-Varnish=1186703160 Date=Tue, 15 Mar 2016 14:59:37 GMT
Content-Disposition=inline; filename="file.pdf" nutch.crawl.score=0.0 Via=1.1
varnish Accept-Ranges=bytes Content-Type=application/pdf
Parse Metadata: access_permission:extract_for_accessibility=true
meta:save-date=2015-06-12T14:47:32Z dcterms:created=2015-06-12T14:47:32Z
date=2015-06-12T14:47:32Z access_permission:can_modify=true
access_permission:modify_annotations=true Creation-Date=2015-06-12T14:47:32Z
created=Fri Jun 12 16:47:32 CEST 2015 access_permission:fill_in_form=true
access_permission:can_print=true dc:format=application/pdf; version=1.4
xmp:CreatorTool=RICOH MP 3353 Last-Save-Date=2015-06-12T14:47:32Z
access_permission:assemble_document=true
meta:creation-date=2015-06-12T14:47:32Z dcterms:modified=2015-06-12T14:47:32Z
Last-Modified=2015-06-12T14:47:32Z pdf:PDFVersion=1.4
modified=2015-06-12T14:47:32Z xmpTPg:NPages=45
access_permission:can_print_degraded=true pdf:encrypted=false
access_permission:extract_content=true producer=RICOH MP 3353
Content-Type=application/pdf
---------
ParseText
---------
thanks,
Eric
https://tika.apache.org/1.11/gettingstarted.html
was (Author: eldk):
Hello,
OCR for image in PDF still not working with nutch 1.11, lib/tika-core-1.11.jar,
plugins/parse-tika/tika-parsers-1.11.jar
tesseract -v
tesseract 3.03
leptonica-1.70
libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 :
webp 0.4.0
bin/nutch parsechecker -dumpText http://domain.tld/file.pdf
fetching: http://domain.tld/file.pdf
robots.txt whitelist not configured.
parsing: http://domain.tld/file.pdf
contentType: application/pdf
signature: af00322e75c5eb43085df668f2faca2f
---------
Url
---------------
http://domain.tld/file.pdf
---------
ParseData
---------
Version: 5
Status: success(1,0)
Title:
Outlinks: 0
Content Metadata: nutch.fetch.time=1458053976974 Age=0 Content-Language=fr-FR
Served-by=domain.tld Content-Length=5052242 Content-Transfer-Encoding=binary
Expires=Tue, 15 Mar 2016 15:09:37 GMT Last-Modified=Fri, 12 Jun 2015 14:58:13
GMT Set-Cookie=eZSESSID=6ns8c06tnu40kd3ohfpl6vnrj5; path=/ Connection=close
X-Cache=Miss from Varnish Server=nginx X-Powered-By=eZ Publish Cache-Control=
Pragma= X-Varnish=1186703160 Date=Tue, 15 Mar 2016 14:59:37 GMT
Content-Disposition=inline; filename="file.pdf" nutch.crawl.score=0.0 Via=1.1
varnish Accept-Ranges=bytes Content-Type=application/pdf
Parse Metadata: access_permission:extract_for_accessibility=true
meta:save-date=2015-06-12T14:47:32Z dcterms:created=2015-06-12T14:47:32Z
date=2015-06-12T14:47:32Z access_permission:can_modify=true
access_permission:modify_annotations=true Creation-Date=2015-06-12T14:47:32Z
created=Fri Jun 12 16:47:32 CEST 2015 access_permission:fill_in_form=true
access_permission:can_print=true dc:format=application/pdf; version=1.4
xmp:CreatorTool=RICOH MP 3353 Last-Save-Date=2015-06-12T14:47:32Z
access_permission:assemble_document=true
meta:creation-date=2015-06-12T14:47:32Z dcterms:modified=2015-06-12T14:47:32Z
Last-Modified=2015-06-12T14:47:32Z pdf:PDFVersion=1.4
modified=2015-06-12T14:47:32Z xmpTPg:NPages=45
access_permission:can_print_degraded=true pdf:encrypted=false
access_permission:extract_content=true producer=RICOH MP 3353
Content-Type=application/pdf
---------
ParseText
---------
thanks,
Eric
https://tika.apache.org/1.11/gettingstarted.html
> Tika cannot OCR embedded images from PDF
> ----------------------------------------
>
> Key: NUTCH-2138
> URL: https://issues.apache.org/jira/browse/NUTCH-2138
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.10
> Environment: Nutch v1.10
> openjdk version "1.8.0_60-internal"
> Debian 7.8
> Tika 1.8 or Tika 1.10
> Reporter: jean blue
>
> Tika 1.10 is able to OCR embedded images if PDFParser.properties is modified
> accordingly in tika-app-1.10.jar but parse-tika doesn't if same modifications
> are made in runtime/local/plugins/parse-tika/tika-parsers-1.10.jar
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)