Or it might be that you have the python image preprocessing libraries
installed (and I don’t)...

Will fix today.

On Thu, May 24, 2018 at 2:55 PM Tim Allison <[email protected]> wrote:

> Y, you're probably running a different version of tesseract than I was
> running and getting different (worse) text out during ocr.  I guess we
> could add an or 'dehaystack'?
>
> On Thu, May 24, 2018 at 12:09 PM, Chris Mattmann <[email protected]>
> wrote:
>
>> Tim,
>>
>>
>>
>> Are you seeing this?
>>
>>
>>
>> Results :
>>
>>
>>
>> Failed tests:
>>
>>
>> PDFParserTest.testEmbeddedDocsWithOCROnly:1250->TikaTest.assertContains:103
>> pdf_haystack not found in:
>>
>> <html xmlns="http://www.w3.org/1999/xhtml";>
>>
>> <head>
>>
>> <meta name="date" content="2013-05-23T18:30:00Z" />
>>
>> <meta name="cp:revision" content="1" />
>>
>> <meta name="extended-properties:AppVersion" content="14.0000" />
>>
>> <meta name="meta:paragraph-count" content="1" />
>>
>> <meta name="meta:word-count" content="16" />
>>
>> <meta name="extended-properties:Company" content="" />
>>
>> <meta name="Word-Count" content="16" />
>>
>> <meta name="dcterms:created" content="2013-05-23T18:30:00Z" />
>>
>> <meta name="meta:line-count" content="1" />
>>
>> <meta name="Last-Modified" content="2013-05-23T18:30:00Z" />
>>
>> <meta name="dcterms:modified" content="2013-05-23T18:30:00Z" />
>>
>> <meta name="Last-Save-Date" content="2013-05-23T18:30:00Z" />
>>
>> <meta name="meta:character-count" content="96" />
>>
>> <meta name="Template" content="Normal.dotm" />
>>
>> <meta name="Line-Count" content="1" />
>>
>> <meta name="Paragraph-Count" content="1" />
>>
>> <meta name="meta:save-date" content="2013-05-23T18:30:00Z" />
>>
>> <meta name="meta:character-count-with-spaces" content="111" />
>>
>> <meta name="Application-Name" content="Microsoft Office Word" />
>>
>> <meta name="modified" content="2013-05-23T18:30:00Z" />
>>
>> <meta name="Content-Type"
>> content="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
>> />
>>
>> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
>>
>> <meta name="X-Parsed-By"
>> content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser" />
>>
>> <meta name="meta:creation-date" content="2013-05-23T18:30:00Z" />
>>
>> <meta name="extended-properties:Application" content="Microsoft Office
>> Word" />
>>
>> <meta name="Creation-Date" content="2013-05-23T18:30:00Z" />
>>
>> <meta name="xmpTPg:NPages" content="1" />
>>
>> <meta name="Character-Count-With-Spaces" content="111" />
>>
>> <meta name="Character Count" content="96" />
>>
>> <meta name="Page-Count" content="1" />
>>
>> <meta name="Revision-Number" content="1" />
>>
>> <meta name="Application-Version" content="14.0000" />
>>
>> <meta name="extended-properties:Template" content="Normal.dotm" />
>>
>> <meta name="publisher" content="" />
>>
>> <meta name="meta:page-count" content="1" />
>>
>> <meta name="dc:publisher" content="" />
>>
>> <title></title>
>>
>> </head>
>>
>> <body><p class="header" />
>>
>> <p class="header" />
>>
>> <p class="header" />
>>
>> <p>Outer_haystack</p>
>>
>> <p>Outer_haystack</p>
>>
>> <p><div class="embedded" id="rId8" />
>>
>> </p>
>>
>> <p>Outer_haystack</p>
>>
>> <p />
>>
>> <p>Outer_haystack</p>
>>
>> <p />
>>
>> <p>Outer_haystack</p>
>>
>> <p><a name="_GoBack" /></p>
>>
>> <p class="footer" />
>>
>> <p class="footer" />
>>
>> <p class="footer" />
>>
>> <p>attached.pdf</p>
>>
>> <div class="page"><div class="ocr">dehayslack dehaystack dehayslack
>> dehaystack dehaystack dehaystack pd'
>>
>>
>>
>> </div>
>>
>> </div>
>>
>> <p class="header" />
>>
>>
>>
>> <p class="header" />
>>
>>
>>
>> <p class="header" />
>>
>>
>>
>> <p>Haystack</p>
>>
>>
>>
>> <p>Needle</p>
>>
>>
>>
>> <p>Haystack</p>
>>
>>
>>
>> <p><a name="_GoBack" /></p>
>>
>>
>>
>> <p class="footer" />
>>
>>
>>
>> <p class="footer" />
>>
>>
>>
>> <p class="footer" />
>>
>>
>>
>> <div source="attachment" class="embedded" id="Test.docx" />
>>
>> </body></html>
>>
>>
>>
>> Tests run: 1009, Failures: 1, Errors: 0, Skipped: 30
>>
>>
>>
>> [INFO]
>> ------------------------------------------------------------------------
>>
>> [INFO] Reactor Summary:
>>
>> [INFO]
>>
>> [INFO] Apache Tika parent ................................. SUCCESS [
>> 1.565 s]
>>
>> [INFO] Apache Tika core ................................... SUCCESS [
>> 32.977 s]
>>
>> [INFO] Apache Tika parsers ................................ FAILURE
>> [05:52 min]
>>
>> [INFO] Apache Tika XMP .................................... SKIPPED
>>
>> [INFO] Apache Tika serialization .......................... SKIPPED
>>
>> [INFO] Apache Tika batch .................................. SKIPPED
>>
>> [INFO] Apache Tika language detection ..................... SKIPPED
>>
>> [INFO] Apache Tika application ............................ SKIPPED
>>
>> [INFO] Apache Tika OSGi bundle ............................ SKIPPED
>>
>> [INFO] Apache Tika translate .............................. SKIPPED
>>
>> [INFO] Apache Tika server ................................. SKIPPED
>>
>> [INFO] Apache Tika examples ............................... SKIPPED
>>
>> [INFO] Apache Tika Java-7 Components ...................... SKIPPED
>>
>> [INFO] Apache Tika eval ................................... SKIPPED
>>
>> [INFO] Apache Tika Deep Learning (powered by DL4J) ........ SKIPPED
>>
>> [INFO] Apache Tika Natural Language Processing ............ SKIPPED
>>
>> [INFO] Apache Tika ........................................ SKIPPED
>>
>> [INFO]
>> ------------------------------------------------------------------------
>>
>> [INFO] BUILD FAILURE
>>
>> [INFO]
>> ------------------------------------------------------------------------
>>
>> [INFO] Total time: 06:27 min
>>
>> [INFO] Finished at: 2018-05-24T09:04:59-07:00
>>
>> [INFO] Final Memory: 72M/1029M
>>
>> [INFO]
>> ------------------------------------------------------------------------
>>
>> [ERROR] Failed to execute goal
>> org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (default-test)
>> on project tika-parsers: There are test failures.
>>
>> [ERROR]
>>
>> [ERROR] Please refer to
>> /Users/mattmann/tmp/tika2.0.0/tika-parsers/target/surefire-reports for the
>> individual test results.
>>
>> [ERROR] -> [Help 1]
>>
>> [ERROR]
>>
>> [ERROR] To see the full stack trace of the errors, re-run Maven with the
>> -e switch.
>>
>> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>>
>> [ERROR]
>>
>> [ERROR] For more information about the errors and possible solutions,
>> please read the following articles:
>>
>> [ERROR] [Help 1]
>> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
>>
>> [ERROR]
>>
>> [ERROR] After correcting the problems, you can resume the build with the
>> command
>>
>> [ERROR]   mvn <goals> -rf :tika-parsers
>>
>>
>>
>> Keeps failing for me.
>>
>> nonas:tika2.0.0 mattmann$ java -version
>>
>> java version "1.8.0_144"
>>
>> Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>>
>> Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>>
>> nonas:tika2.0.0 mattmann$
>>
>>
>>
>> Any ideas?
>>
>>
>>
>> Cheers,
>>
>> Chris
>>
>>
>>
>>
>

Reply via email to