Chris A. Mattmann created TIKA-1421:
---------------------------------------

             Summary: Tika-Parsers tests fail on CentOS6 if tesseract isn't 
installed
                 Key: TIKA-1421
                 URL: https://issues.apache.org/jira/browse/TIKA-1421
             Project: Tika
          Issue Type: Bug
          Components: parser
         Environment: CentOS6 AWS VM for DARPA Memex
            Reporter: Chris A. Mattmann
            Assignee: Chris A. Mattmann
             Fix For: 1.7


While testing TIKA-93 on CentOS6, I ran into some test failing issues on a 
1.7-trunk fresh install of tika in tika-parsers:

{noformat}
Running org.apache.tika.parser.chm.TestChmLzxcControlData
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec
Running org.apache.tika.parser.chm.TestChmBlockInfo
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
Running org.apache.tika.parser.chm.TestChmItsfHeader
Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec
Running org.apache.tika.parser.txt.TXTParserTest
Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 sec
Running org.apache.tika.parser.txt.CharsetDetectorTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
Running org.apache.tika.parser.image.xmp.JempboxExtractorTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec
Running org.apache.tika.parser.image.PSDParserTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
Running org.apache.tika.parser.image.ImageParserTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.034 sec
Running org.apache.tika.parser.image.ImageMetadataExtractorTest
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.241 sec
Running org.apache.tika.parser.image.MetadataFieldsTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running org.apache.tika.parser.image.TiffParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
Running org.apache.tika.parser.font.FontParsersTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.192 sec
Running org.apache.tika.parser.mp4.MP4ParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.07 sec
Running org.apache.tika.parser.mp3.Mp3ParserTest
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.046 sec
Running org.apache.tika.parser.mp3.MpegStreamTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
Running org.apache.tika.parser.dwg.DWGParserTest
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
Running org.apache.tika.parser.pkg.GzipParserTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.252 sec
Running org.apache.tika.parser.pkg.Seven7ParserTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.37 sec
Running org.apache.tika.parser.pkg.TarParserTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.118 sec
Running org.apache.tika.parser.pkg.Bzip2ParserTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.233 sec
Running org.apache.tika.parser.pkg.ArParserTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.017 sec
Running org.apache.tika.parser.pkg.ZipParserTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.302 sec
Running org.apache.tika.parser.video.FLVParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.026 sec
Running org.apache.tika.parser.solidworks.SolidworksParserTest
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.019 sec
Running org.apache.tika.parser.ibooks.iBooksParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.019 sec
Running org.apache.tika.parser.ParsingReaderTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.018 sec
Running org.apache.tika.parser.mail.RFC822ParserTest
Tests run: 8, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 0.31 sec <<< 
FAILURE!
Running org.apache.tika.parser.mbox.MboxParserTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.026 sec
Running org.apache.tika.parser.mbox.OutlookPSTParserTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.094 sec
Running org.apache.tika.parser.jpeg.JpegParserTest
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.153 sec
Running org.apache.tika.parser.executable.ExecutableParserTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
Running org.apache.tika.parser.rtf.RTFParserTest
Tests run: 31, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.221 sec
Running org.apache.tika.parser.fork.ForkParserIntegrationTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.322 sec
Running org.apache.tika.parser.envi.EnviHeaderParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running org.apache.tika.parser.AutoDetectParserTest
Tests run: 22, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 2.439 sec <<< 
FAILURE!
Running org.apache.tika.parser.epub.EpubParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec
Running org.apache.tika.parser.code.SourceCodeParserTest
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.069 sec
Running org.apache.tika.parser.netcdf.NetCDFParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.125 sec
Running org.apache.tika.parser.pdf.PDFParserTest
 WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 205317
 WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 56931
 WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 56931
 WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 56931
 WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 56931
 WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 205317
 INFO [main] (PDFParser.java:248) - Document is encrypted
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at 
offset 116
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at 
offset 5592
 WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 51851
 WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 51851
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at 
offset 116
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at 
offset 5592
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at 
offset 12324
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at 
offset 116
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at 
offset 5969
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at 
offset 116
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at 
offset 5687
 WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 44785
 WARN [main] (FontManager.java:312) - Font not found: Times New Roman
 WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 44785
 WARN [main] (FontManager.java:312) - Font not found: Times New Roman
 WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 56931
 WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 56931
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at 
offset 116
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at 
offset 26441
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at 
offset 116
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at 
offset 5592
 WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 205317
 WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 205317
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at 
offset 116
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at 
offset 8777
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at 
offset 2314576
 WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 68229
 WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 68229
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at 
offset 116
ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at 
offset 5500
 WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 56931
 WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 51851
 INFO [main] (PDFParser.java:248) - Document is encrypted
 INFO [main] (PDFParser.java:248) - Document is encrypted
Tests run: 27, Failures: 3, Errors: 0, Skipped: 0, Time elapsed: 14.305 sec <<< 
FAILURE!
Running org.apache.tika.parser.RecursiveParserWrapperTest
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.233 sec
Running org.apache.tika.parser.prt.PRTParserTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec
Running org.apache.tika.parser.html.HtmlParserTest
Tests run: 38, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0.162 sec
Running org.apache.tika.parser.mat.MatParserTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.543 sec
Running org.apache.tika.parser.feed.FeedParserTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.011 sec
Running org.apache.tika.parser.ocr.TesseractOCRTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 3, Time elapsed: 0.007 sec
Running org.apache.tika.parser.odf.ODFParserTest
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.098 sec
Running org.apache.tika.parser.hdf.HDFParserTest
 WARN [main] (H4header.java:392) - **dimension length=0 for TagVGroup= 
*refno=53 tag= VG (1965) Vgroup length=34 class= Dim0.0 name= Longitude using 
data 52
 WARN [main] (H4header.java:392) - **dimension length=0 for TagVGroup= 
*refno=55 tag= VG (1965) Vgroup length=33 class= Dim0.0 name= Latitude using 
data 54
 WARN [main] (H4header.java:392) - **dimension length=0 for TagVGroup= 
*refno=57 tag= VG (1965) Vgroup length=33 class= Dim0.0 name= fakeDim2 using 
data 56
 WARN [main] (H4header.java:392) - **dimension length=0 for TagVGroup= 
*refno=59 tag= VG (1965) Vgroup length=33 class= Dim0.0 name= fakeDim3 using 
data 58
 WARN [main] (H4header.java:844) - data tag missing vgroup= 70 Sea Surface 
Temperature
 WARN [main] (H4header.java:844) - data tag missing vgroup= 73 Number of 
Observations per Bin
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.087 sec
Running org.apache.tika.embedder.ExternalEmbedderTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec
Running org.apache.tika.mime.MimeTypesTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
Running org.apache.tika.mime.TestMimeTypes
Tests run: 47, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.163 sec
Running org.apache.tika.mime.MimeTypeTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running org.apache.tika.detect.TestContainerAwareDetector
Tests run: 15, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.277 sec
Running org.apache.tika.TestParsers
 WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 68229
 WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 44785
 WARN [main] (FontManager.java:312) - Font not found: Times New Roman
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.2 sec

Results :

Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
Exception thrown: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ocr.TesseractOCRParser@2657d8a0
  testInlineSelector(org.apache.tika.parser.pdf.PDFParserTest): expected:<2> 
but was:<0>
  testInlineConfig(org.apache.tika.parser.pdf.PDFParserTest): expected:<2> but 
was:<0>
  testEmbeddedFilesInChildren(org.apache.tika.parser.pdf.PDFParserTest): 
expected:<5> but was:<3>

Tests in error: 
  testUnusualFromAddress(org.apache.tika.parser.mail.RFC822ParserTest): 
TIKA-198: Illegal IOException from 
org.apache.tika.parser.ocr.TesseractOCRParser@1574a7af
  testImages(org.apache.tika.parser.AutoDetectParserTest): TIKA-198: Illegal 
IOException from org.apache.tika.parser.ocr.TesseractOCRParser@107aac4a

Tests run: 538, Failures: 4, Errors: 2, Skipped: 4
{noformat}

I tried installing Tesseract here:

http://pkgs.org/centos-6/naulinux-school-x86_64/tesseract-3.01-2.el6.x86_64.rpm.html

However, installing that causes the other tests to pass, but the Tesseract ones 
to fail (I think there is something wrong with the English config and am 
looking into it).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to