[
https://issues.apache.org/jira/browse/TIKA-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15644077#comment-15644077
]
Tim Allison commented on TIKA-2167:
-----------------------------------
Thank you for opening this. Can you give more info on what is failing? What
is the stacktrace? How are you running Tika (tika-app, tika-server, API)?
When I run this in trunk via the API:
{noformat}
@Test
public void testTiff() throws Exception {
XMLResult r = getXML("simple.tiff");
System.out.println(r.xml);
}
{noformat}
I get this:
{noformat}
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By"
content="org.apache.tika.parser.ocr.TesseractOCRParser" />
<meta name="Content-Type" content="image/tiff" />
<title></title>
</head>
<body><div class="ocr">HEAVY
METAL
</div>
<html>
<meta name="Strip Byte Counts" content="23139 2217 bytes" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By"
content="org.apache.tika.parser.ocr.TesseractOCRParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.image.TiffParser" />
<meta name="Compression" content="LZW" />
<meta name="File Modified Date" content="Mon Nov 07 12:38:12 +00:00 2016" />
<meta name="Predictor" content="2" />
<meta name="tiff:SamplesPerPixel" content="3" />
<meta name="Unknown tag (0x0153)" content="1 1 1" />
<meta name="tiff:ImageLength" content="165" />
<meta name="Samples Per Pixel" content="3 samples/pixel" />
<meta name="Inter Color Profile" content="[3144 values]" />
<meta name="Image Height" content="165 pixels" />
<meta name="Strip Offsets" content="8 23147" />
<meta name="Orientation" content="Top, left side (Horizontal / normal)" />
<meta name="tiff:Orientation" content="1" />
<meta name="Planar Configuration" content="Chunky (contiguous for each
subsampling pixel)" />
<meta name="Image Width" content="306 pixels" />
<meta name="Photometric Interpretation" content="RGB" />
<meta name="File Size" content="28710 bytes" />
<meta name="Rows Per Strip" content="142 rows/strip" />
<meta name="File Name" content="apache-tika-434186512334376884.tmp" />
<meta name="tiff:BitsPerSample" content="8" />
<meta name="tiff:ImageWidth" content="306" />
<meta name="Content-Type" content="image/tiff" />
<meta name="Bits Per Sample" content="8 8 8 bits/component/pixel" />
<title></title>
<body /></html></body></html>
{noformat}
Looks like we need to fix the xhtml (just opened TIKA-2169), but I'm not
getting a fail...
> Image processing causes OCR to fail
> -----------------------------------
>
> Key: TIKA-2167
> URL: https://issues.apache.org/jira/browse/TIKA-2167
> Project: Tika
> Issue Type: Bug
> Components: ocr
> Affects Versions: 1.14
> Environment: Mac OS X 10.11.6; Java 1.8.0_45; tesseract 3.04.01;
> ImageMagick 6.9.6-2
> Reporter: Matthew Caruana Galizia
> Priority: Critical
> Labels: convert, image, ocr, tiff
> Attachments: simple.tiff
>
>
> Image processing before OCR is enabled by default in the OCR configuration
> properties file. Unless this is disabled, running Tika on a simple TIFF image
> (attached) with two clear words fails. When image processing is disabled, it
> succeeds.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)