Copilot commented on code in PR #2817:
URL: https://github.com/apache/tika/pull/2817#discussion_r3266826052
##########
tika-core/src/main/java/org/apache/tika/sax/BasicContentHandlerFactory.java:
##########
@@ -42,6 +42,7 @@ public class BasicContentHandlerFactory implements
StreamingContentHandlerFactor
private HANDLER_TYPE type = HANDLER_TYPE.MARKDOWN;
private int writeLimit = -1;
private boolean throwOnWriteLimitReached = true;
+ private boolean validateXHTML = false;
private transient ParseContext parseContext;
Review Comment:
The new validateXHTML flag changes handler behavior, but
BasicContentHandlerFactory.equals()/hashCode() still ignore it. This can make
factories with different validation settings compare equal/collide in
hash-based collections or caches, leading to validation being unexpectedly
enabled/disabled. Consider including validateXHTML in equals/hashCode.
##########
tika-core/src/main/java/org/apache/tika/extractor/ParsingEmbeddedDocumentExtractor.java:
##########
@@ -178,10 +178,11 @@ public void parseEmbedded(
recordException(e, context);
} finally {
tis.removeCloseShield();
- }
-
- if (outputHtml) {
- handler.endElement(XHTML, "div", "div");
+ // Always close the package-entry div so XHTML output stays
well-formed
+ // even when the inner parse throws (e.g., zip-bomb depth limits).
+ if (outputHtml) {
+ handler.endElement(XHTML, "div", "div");
+ }
Review Comment:
Always closing the outer "package-entry" <div> does not guarantee
well-formed XHTML if the delegated parser throws after emitting one or more
startElement events (e.g., a TikaException mid-stream). In that case this
endElement("div") can arrive while inner elements are still open and will fail
under StrictXHTMLValidator. Consider wrapping the inner handler in
XHTMLBalancingHandler and calling drainOpenElements() before ending the div so
any partially-open inner elements are closed first.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]