Adrian Bird created TIKA-4747:
---------------------------------

             Summary: tika-4.0.0-alpha1 - PDF and Tesseract Parser Comments
                 Key: TIKA-4747
                 URL: https://issues.apache.org/jira/browse/TIKA-4747
             Project: Tika
          Issue Type: Bug
    Affects Versions: 4.0.0
         Environment: Windows 11
            Reporter: Adrian Bird


I've tried the PDF and Tesseract parsers, independently and together and here 
are some comments.
1. pdf-parser Full Configuration example

The [Full Configuration 
Example|https://tika.apache.org/docs/4.0.0-SNAPSHOT/configuration/parsers/pdf-parser.html#_full_configuration]
 has an unknown property "maxPages".
{code:java}
Caused by: java.lang.RuntimeException: Failed to parse PDFParserConfig 
configuration: Unrecognized field "maxPages" (class 
org.apache.tika.parser.pdf.PDFParserConfig), not marked as ignorable (39 known 
properties: "ocrStrategyAuto", "imageGraphicsEngineFactory", "detectAngles", 
"ocrMaxPagesToOcr", "ignoreContentStreamSpaceGlyphs", "accessCheckMode", 
"extractBookmarksText", "spacingTolerance", "extractUniqueInlineImagesOnly", 
"suppressDuplicateOverlappingText", "extractInlineImages", "enableAutoSpace", 
"imageGraphicsEngineFactoryClass", "extractInlineImageMetadataOnly", 
"extractAnnotationText", "sortByPosition", "ocrDPI", "setKCMS", 
"ocrRenderingStrategy", "ocrMaxImagePixels", "parseIncrementalUpdates", 
"extractMarkedContent", "maxMainMemoryBytes", "imageStrategy", 
"throwOnEncryptedPayload", "ocrStrategy", "ocrImageFormat", 
"extractAcroFormContent", "ocrImageType", "extractFontNames", 
"averageCharTolerance", "dropThreshold", "extractIncrementalUpdateInfo", 
"maxIncrementalUpdates", "ifXFAExtractOnlyXFA", "ocr", "ocrImageQuality", 
"extractActions", "catchIntermediateIOExceptions")
 at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION` 
disabled); line: 1, column: 626] (through reference chain: 
org.apache.tika.parser.pdf.PDFParserConfig["maxPages"])
{code}
2. Refers to 1 above
In the list of 39 known properties the following do not appear in the full 
configuration example:
{noformat}
"imageGraphicsEngineFactory",
"imageGraphicsEngineFactoryClass",
"ocrMaxImagePixels",
"ocrMaxPagesToOcr",
{noformat}
3. Refers to 1 above
Is there a description of the properties somewhere?

4. Tesseract OCR Full Configuration example
The [Full Configuration 
example|https://tika.apache.org/docs/4.0.0-SNAPSHOT/configuration/parsers/tesseract-ocr-parser.html#_full_configuration]
 didn't work for me.
I saw the following message:
{noformat}
DEBUG [main] 10:48:58,549 
org.apache.tika.config.loader.AbstractSpiComponentLoader Skipping SPI parsers - 
'default-parser' not in config
{noformat}
and decided to add the following:
{code:java}
    {
      "default-parser": {}
    }
{code}
That fixed the problem.

The pdf-parser worked without this 'default-parser' entry.

5. Refers to 4 above
Is there a description of the properties somewhere?

Also, is there some documentation to say ImageMagick is an optional component.

6. Disabling Tesseract
A message is output that refers to the old XML way of disabling Tesseract:
{noformat}
INFO  [main] 09:32:13,811 org.apache.tika.parser.ocr.TesseractOCRParser 
Tesseract is installed and is being invoked. This can add greatly to processing 
time.  If you do not want tesseract to be applied to your files see: 
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
{noformat}
7. ImageMagick and Tesseract locations in Windows
If I use a Windows path for the ImageMagick or Tesseract locations I get an 
exception (using / on Windows works ok):
{noformat}
        "imageMagickPath": "C:\ImageMagick",
        "tessdataPath": "C:\Tesseract-OCR\tessdata",
        "tesseractPath": "C:\Tesseract-OCR",
{noformat}
gives the following for an invalid Tesseract location:
{noformat}
Exception in thread "main" java.io.IOException: 
com.fasterxml.jackson.core.JsonParseException: Unrecognized character escape 
'T' (code 84)
 at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION` 
disabled); line: 30, column: 29]
        at org.apache.tika.async.cli.PluginsWriter.write(PluginsWriter.java:167)
        at 
org.apache.tika.async.cli.TikaAsyncCLI.processCommandLine(TikaAsyncCLI.java:117)
        at org.apache.tika.async.cli.TikaAsyncCLI.main(TikaAsyncCLI.java:93)
        at org.apache.tika.cli.TikaCLI.async(TikaCLI.java:301)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:261)
{noformat}
8. ImageMagick Failures
When Tika runs ImageMagick it always returned an error code of 1.

ImageMagick on path and no "imageMagickPath" key set gave these messages:
{noformat}
Use "magick" instead of the deprecated command "magick convert".
WARN  [main] 09:39:35,333 org.apache.tika.parser.ocr.ImagePreprocessor 
ImageMagick failed (commandline: [magick, convert, -density, 300, -depth, 4, 
-colorspace, gray, -filter, triangle, -resize, 200%, 
C:\Users\xxx\AppData\Local\Temp\apache-tika-8707805858872770017.tmp, 
C:\Users\xxx\AppData\Local\Temp\apache-tika-8707805858872770017.tmp])
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit 
value: 1)

{noformat}
ImageMagick not on path and "imageMagickPath" key set gave these messages:
{noformat}
magick: no decode delegate for this image format `' @ 
error/constitute.c/ReadImage/746.
WARN  [main] 10:09:59,780 org.apache.tika.parser.ocr.ImagePreprocessor 
ImageMagick failed (commandline: [C:\ImageMagick\magick, convert, -density, 
300, -depth, 4, -colorspace, gray, -filter, triangle, -resize, 200%, 
C:\Users\xxx\AppData\Local\Temp\apache-tika-4722539874421120895.tmp, 
C:\Users\xxx\AppData\Local\Temp\apache-tika-4722539874421120895.tmp])
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit 
value: 1)

{noformat}
Is the fact that the same filename is used twice at the end a cause for concern?

I know very little about ImageMagick but could reproduce the error by running 
this outside of Tika:
{noformat}
%IMAGEMAGICK_HOME%\magick convert -density 300 -depth 4 -colorspace gray 
-filter triangle -resize 200% image.jpg image.png

{noformat}
I get the error:
{noformat}
magick: no decode delegate for this image format `' @ 
error/constitute.c/ReadImage/746.

{noformat}
If I change it by removing the 'convert' and putting the source image at the 
start:
{noformat}
%IMAGEMAGICK_HOME%\magick image.jpg -density 300 -depth 4 -colorspace gray 
-filter triangle -resize 200% image.png

{noformat}
it runs successfully.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to