Thank you for sharing! Not able to replicate on linux...trying my Windows laptop.
Unrelated...there's something really broken with the xhtml in that there are two bodies. I can replicate this on linux. Will open an issue... On Thu, Apr 15, 2021 at 10:04 AM Peter Kronenberg <[email protected]> wrote: > We’re getting a test failure. I don’t see any recent check-ins that would > be causing this, so maybe it’s been there for awhile (I don’t always run > the tests) > > > > [INFO] Results: > > [INFO] > > [ERROR] Failures: > > [ERROR] > TesseractOCRParserTest.testOCROutputsHOCR:105->TikaTest.assertContains:79 > <span class="ocrx_word" id="word_1_1" not found in: > > <html xmlns=http://www.w3.org/1999/xhtml> > > <head> > > <meta name="pdf:docinfo:custom:AAPL:Keywords" content="" /> > > <meta name="pdf:PDFVersion" content="1.3" /> > > <meta name="pdf:docinfo:title" content="Presentation1" /> > > <meta name="xmp:CreatorTool" content="PowerPoint" /> > > <meta name="pdf:hasXFA" content="false" /> > > <meta name="access_permission:modify_annotations" content="true" /> > > <meta name="access_permission:can_print_degraded" content="true" /> > > <meta name="AAPL:Keywords" content="" /> > > <meta name="dc:creator" content="grantingersoll" /> > > <meta name="dcterms:created" content="2014-02-08T19:57:12Z" /> > > <meta name="dcterms:modified" content="2014-02-08T19:57:12Z" /> > > <meta name="dc:format" content="application/pdf; version=1.3" /> > > <meta name="pdf:docinfo:creator_tool" content="PowerPoint" /> > > <meta name="access_permission:fill_in_form" content="true" /> > > <meta name="pdf:docinfo:keywords" content="" /> > > <meta name="pdf:docinfo:modified" content="2014-02-08T19:57:12Z" /> > > <meta name="pdf:encrypted" content="false" /> > > <meta name="dc:title" content="Presentation1" /> > > <meta name="cp:subject" content="" /> > > <meta name="pdf:docinfo:subject" content="" /> > > <meta name="pdf:hasMarkedContent" content="false" /> > > <meta name="Content-Type" content="application/pdf" /> > > <meta name="pdf:docinfo:creator" content="grantingersoll" /> > > <meta name="dc:subject" content="" /> > > <meta name="dc:subject" content="" /> > > <meta name="dc:subject" content="" /> > > <meta name="dc:subject" content="" /> > > <meta name="pdf:producer" content="Mac OS X 10.9.1 Quartz PDFContext" /> > > <meta name="access_permission:extract_for_accessibility" content="true" /> > > <meta name="access_permission:assemble_document" content="true" /> > > <meta name="xmpTPg:NPages" content="1" /> > > <meta name="pdf:hasXMP" content="false" /> > > <meta name="access_permission:extract_content" content="true" /> > > <meta name="access_permission:can_print" content="true" /> > > <meta name="X-TIKA:Parsed-By" > content="org.apache.tika.parser.DefaultParser" /> > > <meta name="X-TIKA:Parsed-By" > content="org.apache.tika.parser.pdf.PDFParser" /> > > <meta name="meta:keyword" content="" /> > > <meta name="access_permission:can_modify" content="true" /> > > <meta name="pdf:docinfo:producer" content="Mac OS X 10.9.1 Quartz > PDFContext" /> > > <meta name="pdf:docinfo:created" content="2014-02-08T19:57:12Z" /> > > <title>Presentation1</title> > > </head> > > <body><div class="page"><p /> > > <img src="embedded:image0.png" alt="image0.png" /></div> > > </body></html><html xmlns=http://www.w3.org/1999/xhtml> > > <head> > > <meta name="Transparency Alpha" content="none" /> > > <meta name="tiff:ImageLength" content="261" /> > > <meta name="Compression CompressionTypeName" content="deflate" /> > > <meta name="Data BitsPerSample" content="8 8 8" /> > > <meta name="Data PlanarConfiguration" content="PixelInterleaved" /> > > <meta name="Dimension VerticalPixelSize" content="0.35273367" /> > > <meta name="IHDR" content="width=934, height=261, bitDepth=8, > colorType=RGB, compressionMethod=deflate, filterMethod=adaptive, > interlaceMethod=none" /> > > <meta name="embeddedResourceType" content="INLINE" /> > > <meta name="Chroma ColorSpaceType" content="RGB" /> > > <meta name="tiff:BitsPerSample" content="8 8 8" /> > > <meta name="Content-Type" content="image/png" /> > > <meta name="height" content="261" /> > > <meta name="pHYs" content="pixelsPerUnitXAxis=2835, > pixelsPerUnitYAxis=2835, unitSpecifier=meter" /> > > <meta name="Dimension PixelAspectRatio" content="1.0" /> > > <meta name="resourceName" content="image0.png" /> > > <meta name="pdf:hasXMP" content="false" /> > > <meta name="Compression NumProgressiveScans" content="1" /> > > <meta name="Content-Type-Parser-Override" content="image/ocr-png" /> > > <meta name="X-TIKA:Parsed-By" > content="org.apache.tika.parser.DefaultParser" /> > > <meta name="X-TIKA:Parsed-By" > content="org.apache.tika.parser.image.ImageParser" /> > > <meta name="X-TIKA:Parsed-By" > content="org.apache.tika.parser.ocr.TesseractOCRParser" /> > > <meta name="Dimension HorizontalPixelSize" content="0.35273367" /> > > <meta name="Chroma BlackIsZero" content="true" /> > > <meta name="Compression Lossless" content="true" /> > > <meta name="X-TIKA:embedded_depth" content="1" /> > > <meta name="width" content="934" /> > > <meta name="Dimension ImageOrientation" content="Normal" /> > > <meta name="X-TIKA:embedded_resource_path" content="/image0.png" /> > > <meta name="tiff:ImageWidth" content="934" /> > > <meta name="Chroma NumChannels" content="3" /> > > <meta name="Data SampleFormat" content="UnsignedIntegral" /> > > <title></title> > > </head> > > <body /></html> > > [INFO] > > [ERROR] Tests run: 305, Failures: 1, Errors: 0, Skipped: 10 > > [INFO] > > [INFO] > ------------------------------------------------------------------------ > > [INFO] Reactor Summary for Apache Tika parent 2.0.0-SNAPSHOT: > > [INFO] > > [INFO] Apache Tika parent ................................. SUCCESS [ > 2.952 s] > > [INFO] Apache Tika core ................................... SUCCESS [ > 37.037 s] > > [INFO] tika-parsers ....................................... SUCCESS [ > 0.225 s] > > [INFO] Apache Tika classic parser modules and package ..... SUCCESS [ > 0.500 s] > > [INFO] Apache Tika classic parser modules ................. SUCCESS [ > 0.261 s] > > [INFO] tika-parser-html-commons ........................... SUCCESS [ > 1.773 s] > > [INFO] tika-parser-digest-commons ......................... SUCCESS [ > 0.998 s] > > [INFO] tika-parser-mail-commons ........................... SUCCESS [ > 1.627 s] > > [INFO] tika-parser-xmp-commons ............................ SUCCESS [ > 2.008 s] > > [INFO] tika-parser-zip-commons ............................ SUCCESS [ > 2.405 s] > > [INFO] tika-parser-image-module ........................... SUCCESS [ > 4.140 s] > > [INFO] tika-parser-ocr-module ............................. SUCCESS [ > 16.227 s] > > [INFO] tika-parser-audiovideo-module ...................... SUCCESS [ > 2.998 s] > > [INFO] tika-parser-text-module ............................ SUCCESS [ > 3.578 s] > > [INFO] tika-parser-code-module ............................ SUCCESS [ > 3.739 s] > > [INFO] tika-parser-html-module ............................ SUCCESS [ > 3.842 s] > > [INFO] tika-parser-font-module ............................ SUCCESS [ > 2.291 s] > > [INFO] tika-parser-xml-module ............................. SUCCESS [ > 2.637 s] > > [INFO] tika-parser-microsoft-module ....................... SUCCESS [ > 46.829 s] > > [INFO] tika-parser-pkg-module ............................. SUCCESS [ > 3.862 s] > > [INFO] tika-parser-pdf-module ............................. SUCCESS [ > 15.538 s] > > [INFO] tika-parser-apple-module ........................... SUCCESS [ > 3.497 s] > > [INFO] tika-parser-cad-module ............................. SUCCESS [ > 2.195 s] > > [INFO] tika-parser-mail-module ............................ SUCCESS [ > 9.893 s] > > [INFO] tika-parser-miscoffice-module ...................... SUCCESS [ > 8.474 s] > > [INFO] tika-parser-news-module ............................ SUCCESS [ > 1.982 s] > > [INFO] tika-parser-crypto-module .......................... SUCCESS [ > 2.624 s] > > [INFO] Apache Tika classic parser package ................. FAILURE [02:15 > min] > > [INFO] > ------------------------------------------------------------------------ > > [INFO] BUILD FAILURE > > [INFO] > ------------------------------------------------------------------------ > > [INFO] Total time: 05:21 min > > [INFO] Finished at: 2021-04-15T10:00:49-04:00 > > [INFO] > ------------------------------------------------------------------------ > > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M4:test (default-test) > on project tika-parsers-classic-package: There are test failures. > > [ERROR] > > [ERROR] Please refer to > C:\tika\tika-parsers\tika-parsers-classic\tika-parsers-classic-package\target\surefire-reports > for the individual test results. > > [ERROR] Please refer to dump files (if any exist) [date].dump, > [date]-jvmRun[N].dump and [date].dumpstream. > > [ERROR] -> [Help 1] > > [ERROR] > > [ERROR] To see the full stack trace of the errors, re-run Maven with the > -e switch. > > [ERROR] Re-run Maven using the -X switch to enable full debug logging. > > [ERROR] > > [ERROR] For more information about the errors and possible solutions, > please read the following articles: > > [ERROR] [Help 1] > http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException > > [ERROR] > > [ERROR] After correcting the problems, you can resume the build with the > command > > [ERROR] mvn <args> -rf :tika-parsers-classic-package > > > > c:\tika> > > > > > > > > > > > > *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * > > *C: 703.887.5623* > > [image: Torch AI] <http://www.torch.ai/> > > 4303 W. 119th St., Leawood, KS 66209 > WWW.TORCH.AI <http://www.torch.ai/> > > > > >
