Adrian Bird created TIKA-4747:
---------------------------------
Summary: tika-4.0.0-alpha1 - PDF and Tesseract Parser Comments
Key: TIKA-4747
URL: https://issues.apache.org/jira/browse/TIKA-4747
Project: Tika
Issue Type: Bug
Affects Versions: 4.0.0
Environment: Windows 11
Reporter: Adrian Bird
I've tried the PDF and Tesseract parsers, independently and together and here
are some comments.
1. pdf-parser Full Configuration example
The [Full Configuration
Example|https://tika.apache.org/docs/4.0.0-SNAPSHOT/configuration/parsers/pdf-parser.html#_full_configuration]
has an unknown property "maxPages".
{code:java}
Caused by: java.lang.RuntimeException: Failed to parse PDFParserConfig
configuration: Unrecognized field "maxPages" (class
org.apache.tika.parser.pdf.PDFParserConfig), not marked as ignorable (39 known
properties: "ocrStrategyAuto", "imageGraphicsEngineFactory", "detectAngles",
"ocrMaxPagesToOcr", "ignoreContentStreamSpaceGlyphs", "accessCheckMode",
"extractBookmarksText", "spacingTolerance", "extractUniqueInlineImagesOnly",
"suppressDuplicateOverlappingText", "extractInlineImages", "enableAutoSpace",
"imageGraphicsEngineFactoryClass", "extractInlineImageMetadataOnly",
"extractAnnotationText", "sortByPosition", "ocrDPI", "setKCMS",
"ocrRenderingStrategy", "ocrMaxImagePixels", "parseIncrementalUpdates",
"extractMarkedContent", "maxMainMemoryBytes", "imageStrategy",
"throwOnEncryptedPayload", "ocrStrategy", "ocrImageFormat",
"extractAcroFormContent", "ocrImageType", "extractFontNames",
"averageCharTolerance", "dropThreshold", "extractIncrementalUpdateInfo",
"maxIncrementalUpdates", "ifXFAExtractOnlyXFA", "ocr", "ocrImageQuality",
"extractActions", "catchIntermediateIOExceptions")
at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION`
disabled); line: 1, column: 626] (through reference chain:
org.apache.tika.parser.pdf.PDFParserConfig["maxPages"])
{code}
2. Refers to 1 above
In the list of 39 known properties the following do not appear in the full
configuration example:
{noformat}
"imageGraphicsEngineFactory",
"imageGraphicsEngineFactoryClass",
"ocrMaxImagePixels",
"ocrMaxPagesToOcr",
{noformat}
3. Refers to 1 above
Is there a description of the properties somewhere?
4. Tesseract OCR Full Configuration example
The [Full Configuration
example|https://tika.apache.org/docs/4.0.0-SNAPSHOT/configuration/parsers/tesseract-ocr-parser.html#_full_configuration]
didn't work for me.
I saw the following message:
{noformat}
DEBUG [main] 10:48:58,549
org.apache.tika.config.loader.AbstractSpiComponentLoader Skipping SPI parsers -
'default-parser' not in config
{noformat}
and decided to add the following:
{code:java}
{
"default-parser": {}
}
{code}
That fixed the problem.
The pdf-parser worked without this 'default-parser' entry.
5. Refers to 4 above
Is there a description of the properties somewhere?
Also, is there some documentation to say ImageMagick is an optional component.
6. Disabling Tesseract
A message is output that refers to the old XML way of disabling Tesseract:
{noformat}
INFO [main] 09:32:13,811 org.apache.tika.parser.ocr.TesseractOCRParser
Tesseract is installed and is being invoked. This can add greatly to processing
time. If you do not want tesseract to be applied to your files see:
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
{noformat}
7. ImageMagick and Tesseract locations in Windows
If I use a Windows path for the ImageMagick or Tesseract locations I get an
exception (using / on Windows works ok):
{noformat}
"imageMagickPath": "C:\ImageMagick",
"tessdataPath": "C:\Tesseract-OCR\tessdata",
"tesseractPath": "C:\Tesseract-OCR",
{noformat}
gives the following for an invalid Tesseract location:
{noformat}
Exception in thread "main" java.io.IOException:
com.fasterxml.jackson.core.JsonParseException: Unrecognized character escape
'T' (code 84)
at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION`
disabled); line: 30, column: 29]
at org.apache.tika.async.cli.PluginsWriter.write(PluginsWriter.java:167)
at
org.apache.tika.async.cli.TikaAsyncCLI.processCommandLine(TikaAsyncCLI.java:117)
at org.apache.tika.async.cli.TikaAsyncCLI.main(TikaAsyncCLI.java:93)
at org.apache.tika.cli.TikaCLI.async(TikaCLI.java:301)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:261)
{noformat}
8. ImageMagick Failures
When Tika runs ImageMagick it always returned an error code of 1.
ImageMagick on path and no "imageMagickPath" key set gave these messages:
{noformat}
Use "magick" instead of the deprecated command "magick convert".
WARN [main] 09:39:35,333 org.apache.tika.parser.ocr.ImagePreprocessor
ImageMagick failed (commandline: [magick, convert, -density, 300, -depth, 4,
-colorspace, gray, -filter, triangle, -resize, 200%,
C:\Users\xxx\AppData\Local\Temp\apache-tika-8707805858872770017.tmp,
C:\Users\xxx\AppData\Local\Temp\apache-tika-8707805858872770017.tmp])
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit
value: 1)
{noformat}
ImageMagick not on path and "imageMagickPath" key set gave these messages:
{noformat}
magick: no decode delegate for this image format `' @
error/constitute.c/ReadImage/746.
WARN [main] 10:09:59,780 org.apache.tika.parser.ocr.ImagePreprocessor
ImageMagick failed (commandline: [C:\ImageMagick\magick, convert, -density,
300, -depth, 4, -colorspace, gray, -filter, triangle, -resize, 200%,
C:\Users\xxx\AppData\Local\Temp\apache-tika-4722539874421120895.tmp,
C:\Users\xxx\AppData\Local\Temp\apache-tika-4722539874421120895.tmp])
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit
value: 1)
{noformat}
Is the fact that the same filename is used twice at the end a cause for concern?
I know very little about ImageMagick but could reproduce the error by running
this outside of Tika:
{noformat}
%IMAGEMAGICK_HOME%\magick convert -density 300 -depth 4 -colorspace gray
-filter triangle -resize 200% image.jpg image.png
{noformat}
I get the error:
{noformat}
magick: no decode delegate for this image format `' @
error/constitute.c/ReadImage/746.
{noformat}
If I change it by removing the 'convert' and putting the source image at the
start:
{noformat}
%IMAGEMAGICK_HOME%\magick image.jpg -density 300 -depth 4 -colorspace gray
-filter triangle -resize 200% image.png
{noformat}
it runs successfully.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)