Chris A. Mattmann created TIKA-1421:
---------------------------------------
Summary: Tika-Parsers tests fail on CentOS6 if tesseract isn't
installed
Key: TIKA-1421
URL: https://issues.apache.org/jira/browse/TIKA-1421
Project: Tika
Issue Type: Bug
Components: parser
Environment: CentOS6 AWS VM for DARPA Memex
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Fix For: 1.7
While testing TIKA-93 on CentOS6, I ran into some test failing issues on a
1.7-trunk fresh install of tika in tika-parsers:
{noformat}
Running org.apache.tika.parser.chm.TestChmLzxcControlData
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec
Running org.apache.tika.parser.chm.TestChmBlockInfo
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
Running org.apache.tika.parser.chm.TestChmItsfHeader
Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec
Running org.apache.tika.parser.txt.TXTParserTest
Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 sec
Running org.apache.tika.parser.txt.CharsetDetectorTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
Running org.apache.tika.parser.image.xmp.JempboxExtractorTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec
Running org.apache.tika.parser.image.PSDParserTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
Running org.apache.tika.parser.image.ImageParserTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.034 sec
Running org.apache.tika.parser.image.ImageMetadataExtractorTest
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.241 sec
Running org.apache.tika.parser.image.MetadataFieldsTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running org.apache.tika.parser.image.TiffParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
Running org.apache.tika.parser.font.FontParsersTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.192 sec
Running org.apache.tika.parser.mp4.MP4ParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.07 sec
Running org.apache.tika.parser.mp3.Mp3ParserTest
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.046 sec
Running org.apache.tika.parser.mp3.MpegStreamTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
Running org.apache.tika.parser.dwg.DWGParserTest
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
Running org.apache.tika.parser.pkg.GzipParserTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.252 sec
Running org.apache.tika.parser.pkg.Seven7ParserTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.37 sec
Running org.apache.tika.parser.pkg.TarParserTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.118 sec
Running org.apache.tika.parser.pkg.Bzip2ParserTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.233 sec
Running org.apache.tika.parser.pkg.ArParserTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.017 sec
Running org.apache.tika.parser.pkg.ZipParserTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.302 sec
Running org.apache.tika.parser.video.FLVParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.026 sec
Running org.apache.tika.parser.solidworks.SolidworksParserTest
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.019 sec
Running org.apache.tika.parser.ibooks.iBooksParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.019 sec
Running org.apache.tika.parser.ParsingReaderTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.018 sec
Running org.apache.tika.parser.mail.RFC822ParserTest
Tests run: 8, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 0.31 sec <<<
FAILURE!
Running org.apache.tika.parser.mbox.MboxParserTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.026 sec
Running org.apache.tika.parser.mbox.OutlookPSTParserTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.094 sec
Running org.apache.tika.parser.jpeg.JpegParserTest
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.153 sec
Running org.apache.tika.parser.executable.ExecutableParserTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
Running org.apache.tika.parser.rtf.RTFParserTest
Tests run: 31, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.221 sec
Running org.apache.tika.parser.fork.ForkParserIntegrationTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.322 sec
Running org.apache.tika.parser.envi.EnviHeaderParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running org.apache.tika.parser.AutoDetectParserTest
Tests run: 22, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 2.439 sec <<<
FAILURE!
Running org.apache.tika.parser.epub.EpubParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec
Running org.apache.tika.parser.code.SourceCodeParserTest
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.069 sec
Running org.apache.tika.parser.netcdf.NetCDFParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.125 sec
Running org.apache.tika.parser.pdf.PDFParserTest
WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 205317
WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 56931
WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 56931
WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 56931
WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 56931
WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 205317
INFO [main] (PDFParser.java:248) - Document is encrypted
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at
offset 116
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at
offset 5592
WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 51851
WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 51851
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at
offset 116
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at
offset 5592
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at
offset 12324
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at
offset 116
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at
offset 5969
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at
offset 116
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at
offset 5687
WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 44785
WARN [main] (FontManager.java:312) - Font not found: Times New Roman
WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 44785
WARN [main] (FontManager.java:312) - Font not found: Times New Roman
WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 56931
WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 56931
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at
offset 116
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at
offset 26441
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at
offset 116
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at
offset 5592
WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 205317
WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 205317
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at
offset 116
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at
offset 8777
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at
offset 2314576
WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 68229
WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 68229
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at
offset 116
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at
offset 5500
WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 56931
WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 51851
INFO [main] (PDFParser.java:248) - Document is encrypted
INFO [main] (PDFParser.java:248) - Document is encrypted
Tests run: 27, Failures: 3, Errors: 0, Skipped: 0, Time elapsed: 14.305 sec <<<
FAILURE!
Running org.apache.tika.parser.RecursiveParserWrapperTest
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.233 sec
Running org.apache.tika.parser.prt.PRTParserTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec
Running org.apache.tika.parser.html.HtmlParserTest
Tests run: 38, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0.162 sec
Running org.apache.tika.parser.mat.MatParserTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.543 sec
Running org.apache.tika.parser.feed.FeedParserTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.011 sec
Running org.apache.tika.parser.ocr.TesseractOCRTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 3, Time elapsed: 0.007 sec
Running org.apache.tika.parser.odf.ODFParserTest
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.098 sec
Running org.apache.tika.parser.hdf.HDFParserTest
WARN [main] (H4header.java:392) - **dimension length=0 for TagVGroup=
*refno=53 tag= VG (1965) Vgroup length=34 class= Dim0.0 name= Longitude using
data 52
WARN [main] (H4header.java:392) - **dimension length=0 for TagVGroup=
*refno=55 tag= VG (1965) Vgroup length=33 class= Dim0.0 name= Latitude using
data 54
WARN [main] (H4header.java:392) - **dimension length=0 for TagVGroup=
*refno=57 tag= VG (1965) Vgroup length=33 class= Dim0.0 name= fakeDim2 using
data 56
WARN [main] (H4header.java:392) - **dimension length=0 for TagVGroup=
*refno=59 tag= VG (1965) Vgroup length=33 class= Dim0.0 name= fakeDim3 using
data 58
WARN [main] (H4header.java:844) - data tag missing vgroup= 70 Sea Surface
Temperature
WARN [main] (H4header.java:844) - data tag missing vgroup= 73 Number of
Observations per Bin
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.087 sec
Running org.apache.tika.embedder.ExternalEmbedderTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec
Running org.apache.tika.mime.MimeTypesTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
Running org.apache.tika.mime.TestMimeTypes
Tests run: 47, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.163 sec
Running org.apache.tika.mime.MimeTypeTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running org.apache.tika.detect.TestContainerAwareDetector
Tests run: 15, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.277 sec
Running org.apache.tika.TestParsers
WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 68229
WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 44785
WARN [main] (FontManager.java:312) - Font not found: Times New Roman
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.2 sec
Results :
Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest):
Exception thrown: TIKA-198: Illegal IOException from
org.apache.tika.parser.ocr.TesseractOCRParser@2657d8a0
testInlineSelector(org.apache.tika.parser.pdf.PDFParserTest): expected:<2>
but was:<0>
testInlineConfig(org.apache.tika.parser.pdf.PDFParserTest): expected:<2> but
was:<0>
testEmbeddedFilesInChildren(org.apache.tika.parser.pdf.PDFParserTest):
expected:<5> but was:<3>
Tests in error:
testUnusualFromAddress(org.apache.tika.parser.mail.RFC822ParserTest):
TIKA-198: Illegal IOException from
org.apache.tika.parser.ocr.TesseractOCRParser@1574a7af
testImages(org.apache.tika.parser.AutoDetectParserTest): TIKA-198: Illegal
IOException from org.apache.tika.parser.ocr.TesseractOCRParser@107aac4a
Tests run: 538, Failures: 4, Errors: 2, Skipped: 4
{noformat}
I tried installing Tesseract here:
http://pkgs.org/centos-6/naulinux-school-x86_64/tesseract-3.01-2.el6.x86_64.rpm.html
However, installing that causes the other tests to pass, but the Tesseract ones
to fail (I think there is something wrong with the English config and am
looking into it).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)