Tika 715 (invalid xhtml output)

Raymond Wiker Wed, 11 Dec 2013 03:34:22 -0800

Ref: https://issues.apache.org/jira/browse/TIKA-715


I'm using Tika-app-1.4 (in server-mode) in a stand-alone document
processing pipeline, and have discovered that a lot of the xhtml from Tika
is invalid. Subsequently, I found Tika-715, which appears to cover exactly
this.

Because of this issue, I cannot use my preferred XML parsing library to
extract metadata and text from the xhtml output. As a workaround, I have
tried to use an HTML parser, instead; this works, but requires much more
resources (cpu time and memory).

Is there hope for a fix for this issue in the near future, or should I just
concentrate on improving my code for working on the html format?

Tika 715 (invalid xhtml output)

Reply via email to