Copilot commented on code in PR #2817:
URL: https://github.com/apache/tika/pull/2817#discussion_r3266826052


##########
tika-core/src/main/java/org/apache/tika/sax/BasicContentHandlerFactory.java:
##########
@@ -42,6 +42,7 @@ public class BasicContentHandlerFactory implements 
StreamingContentHandlerFactor
     private HANDLER_TYPE type = HANDLER_TYPE.MARKDOWN;
     private int writeLimit = -1;
     private boolean throwOnWriteLimitReached = true;
+    private boolean validateXHTML = false;
     private transient ParseContext parseContext;

Review Comment:
   The new validateXHTML flag changes handler behavior, but 
BasicContentHandlerFactory.equals()/hashCode() still ignore it. This can make 
factories with different validation settings compare equal/collide in 
hash-based collections or caches, leading to validation being unexpectedly 
enabled/disabled. Consider including validateXHTML in equals/hashCode.



##########
tika-core/src/main/java/org/apache/tika/extractor/ParsingEmbeddedDocumentExtractor.java:
##########
@@ -178,10 +178,11 @@ public void parseEmbedded(
             recordException(e, context);
         } finally {
             tis.removeCloseShield();
-        }
-
-        if (outputHtml) {
-            handler.endElement(XHTML, "div", "div");
+            // Always close the package-entry div so XHTML output stays 
well-formed
+            // even when the inner parse throws (e.g., zip-bomb depth limits).
+            if (outputHtml) {
+                handler.endElement(XHTML, "div", "div");
+            }

Review Comment:
   Always closing the outer "package-entry" <div> does not guarantee 
well-formed XHTML if the delegated parser throws after emitting one or more 
startElement events (e.g., a TikaException mid-stream). In that case this 
endElement("div") can arrive while inner elements are still open and will fail 
under StrictXHTMLValidator. Consider wrapping the inner handler in 
XHTMLBalancingHandler and calling drainOpenElements() before ending the div so 
any partially-open inner elements are closed first.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to