Ref: https://issues.apache.org/jira/browse/TIKA-715
I'm using Tika-app-1.4 (in server-mode) in a stand-alone document processing pipeline, and have discovered that a lot of the xhtml from Tika is invalid. Subsequently, I found Tika-715, which appears to cover exactly this. Because of this issue, I cannot use my preferred XML parsing library to extract metadata and text from the xhtml output. As a workaround, I have tried to use an HTML parser, instead; this works, but requires much more resources (cpu time and memory). Is there hope for a fix for this issue in the near future, or should I just concentrate on improving my code for working on the html format?
