Or it might be that you have the python image preprocessing libraries installed (and I don’t)...
Will fix today. On Thu, May 24, 2018 at 2:55 PM Tim Allison <[email protected]> wrote: > Y, you're probably running a different version of tesseract than I was > running and getting different (worse) text out during ocr. I guess we > could add an or 'dehaystack'? > > On Thu, May 24, 2018 at 12:09 PM, Chris Mattmann <[email protected]> > wrote: > >> Tim, >> >> >> >> Are you seeing this? >> >> >> >> Results : >> >> >> >> Failed tests: >> >> >> PDFParserTest.testEmbeddedDocsWithOCROnly:1250->TikaTest.assertContains:103 >> pdf_haystack not found in: >> >> <html xmlns="http://www.w3.org/1999/xhtml"> >> >> <head> >> >> <meta name="date" content="2013-05-23T18:30:00Z" /> >> >> <meta name="cp:revision" content="1" /> >> >> <meta name="extended-properties:AppVersion" content="14.0000" /> >> >> <meta name="meta:paragraph-count" content="1" /> >> >> <meta name="meta:word-count" content="16" /> >> >> <meta name="extended-properties:Company" content="" /> >> >> <meta name="Word-Count" content="16" /> >> >> <meta name="dcterms:created" content="2013-05-23T18:30:00Z" /> >> >> <meta name="meta:line-count" content="1" /> >> >> <meta name="Last-Modified" content="2013-05-23T18:30:00Z" /> >> >> <meta name="dcterms:modified" content="2013-05-23T18:30:00Z" /> >> >> <meta name="Last-Save-Date" content="2013-05-23T18:30:00Z" /> >> >> <meta name="meta:character-count" content="96" /> >> >> <meta name="Template" content="Normal.dotm" /> >> >> <meta name="Line-Count" content="1" /> >> >> <meta name="Paragraph-Count" content="1" /> >> >> <meta name="meta:save-date" content="2013-05-23T18:30:00Z" /> >> >> <meta name="meta:character-count-with-spaces" content="111" /> >> >> <meta name="Application-Name" content="Microsoft Office Word" /> >> >> <meta name="modified" content="2013-05-23T18:30:00Z" /> >> >> <meta name="Content-Type" >> content="application/vnd.openxmlformats-officedocument.wordprocessingml.document" >> /> >> >> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" /> >> >> <meta name="X-Parsed-By" >> content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser" /> >> >> <meta name="meta:creation-date" content="2013-05-23T18:30:00Z" /> >> >> <meta name="extended-properties:Application" content="Microsoft Office >> Word" /> >> >> <meta name="Creation-Date" content="2013-05-23T18:30:00Z" /> >> >> <meta name="xmpTPg:NPages" content="1" /> >> >> <meta name="Character-Count-With-Spaces" content="111" /> >> >> <meta name="Character Count" content="96" /> >> >> <meta name="Page-Count" content="1" /> >> >> <meta name="Revision-Number" content="1" /> >> >> <meta name="Application-Version" content="14.0000" /> >> >> <meta name="extended-properties:Template" content="Normal.dotm" /> >> >> <meta name="publisher" content="" /> >> >> <meta name="meta:page-count" content="1" /> >> >> <meta name="dc:publisher" content="" /> >> >> <title></title> >> >> </head> >> >> <body><p class="header" /> >> >> <p class="header" /> >> >> <p class="header" /> >> >> <p>Outer_haystack</p> >> >> <p>Outer_haystack</p> >> >> <p><div class="embedded" id="rId8" /> >> >> </p> >> >> <p>Outer_haystack</p> >> >> <p /> >> >> <p>Outer_haystack</p> >> >> <p /> >> >> <p>Outer_haystack</p> >> >> <p><a name="_GoBack" /></p> >> >> <p class="footer" /> >> >> <p class="footer" /> >> >> <p class="footer" /> >> >> <p>attached.pdf</p> >> >> <div class="page"><div class="ocr">dehayslack dehaystack dehayslack >> dehaystack dehaystack dehaystack pd' >> >> >> >> </div> >> >> </div> >> >> <p class="header" /> >> >> >> >> <p class="header" /> >> >> >> >> <p class="header" /> >> >> >> >> <p>Haystack</p> >> >> >> >> <p>Needle</p> >> >> >> >> <p>Haystack</p> >> >> >> >> <p><a name="_GoBack" /></p> >> >> >> >> <p class="footer" /> >> >> >> >> <p class="footer" /> >> >> >> >> <p class="footer" /> >> >> >> >> <div source="attachment" class="embedded" id="Test.docx" /> >> >> </body></html> >> >> >> >> Tests run: 1009, Failures: 1, Errors: 0, Skipped: 30 >> >> >> >> [INFO] >> ------------------------------------------------------------------------ >> >> [INFO] Reactor Summary: >> >> [INFO] >> >> [INFO] Apache Tika parent ................................. SUCCESS [ >> 1.565 s] >> >> [INFO] Apache Tika core ................................... SUCCESS [ >> 32.977 s] >> >> [INFO] Apache Tika parsers ................................ FAILURE >> [05:52 min] >> >> [INFO] Apache Tika XMP .................................... SKIPPED >> >> [INFO] Apache Tika serialization .......................... SKIPPED >> >> [INFO] Apache Tika batch .................................. SKIPPED >> >> [INFO] Apache Tika language detection ..................... SKIPPED >> >> [INFO] Apache Tika application ............................ SKIPPED >> >> [INFO] Apache Tika OSGi bundle ............................ SKIPPED >> >> [INFO] Apache Tika translate .............................. SKIPPED >> >> [INFO] Apache Tika server ................................. SKIPPED >> >> [INFO] Apache Tika examples ............................... SKIPPED >> >> [INFO] Apache Tika Java-7 Components ...................... SKIPPED >> >> [INFO] Apache Tika eval ................................... SKIPPED >> >> [INFO] Apache Tika Deep Learning (powered by DL4J) ........ SKIPPED >> >> [INFO] Apache Tika Natural Language Processing ............ SKIPPED >> >> [INFO] Apache Tika ........................................ SKIPPED >> >> [INFO] >> ------------------------------------------------------------------------ >> >> [INFO] BUILD FAILURE >> >> [INFO] >> ------------------------------------------------------------------------ >> >> [INFO] Total time: 06:27 min >> >> [INFO] Finished at: 2018-05-24T09:04:59-07:00 >> >> [INFO] Final Memory: 72M/1029M >> >> [INFO] >> ------------------------------------------------------------------------ >> >> [ERROR] Failed to execute goal >> org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (default-test) >> on project tika-parsers: There are test failures. >> >> [ERROR] >> >> [ERROR] Please refer to >> /Users/mattmann/tmp/tika2.0.0/tika-parsers/target/surefire-reports for the >> individual test results. >> >> [ERROR] -> [Help 1] >> >> [ERROR] >> >> [ERROR] To see the full stack trace of the errors, re-run Maven with the >> -e switch. >> >> [ERROR] Re-run Maven using the -X switch to enable full debug logging. >> >> [ERROR] >> >> [ERROR] For more information about the errors and possible solutions, >> please read the following articles: >> >> [ERROR] [Help 1] >> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException >> >> [ERROR] >> >> [ERROR] After correcting the problems, you can resume the build with the >> command >> >> [ERROR] mvn <goals> -rf :tika-parsers >> >> >> >> Keeps failing for me. >> >> nonas:tika2.0.0 mattmann$ java -version >> >> java version "1.8.0_144" >> >> Java(TM) SE Runtime Environment (build 1.8.0_144-b01) >> >> Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode) >> >> nonas:tika2.0.0 mattmann$ >> >> >> >> Any ideas? >> >> >> >> Cheers, >> >> Chris >> >> >> >> >
