[jira] [Commented] (TIKA-4728) Validate xhtml output, generally

ASF GitHub Bot (Jira) Tue, 19 May 2026 21:17:07 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18082207#comment-18082207
 ]


ASF GitHub Bot commented on TIKA-4728:
--------------------------------------

Copilot commented on code in PR #2817:
URL: https://github.com/apache/tika/pull/2817#discussion_r3271181241


##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-apple-module/src/main/java/org/apache/tika/parser/iwork/PagesContentHandler.java:
##########
@@ -59,7 +59,9 @@ class PagesContentHandler extends DefaultHandler {
     @Override
     public void endDocument() throws SAXException {
         metadata.set(Office.PAGE_COUNT, String.valueOf(pageCount));
-        if (pageCount > 0) {
+        // Either sf:page-start or sl:page-group opens a <div>; close the
+        // last open one regardless of which counter tracked it.
+        if (pageCount + slPageCount > 0) {
             doFooter();
             xhtml.endElement("div");

Review Comment:
   `endDocument()` sets `Office.PAGE_COUNT` from `pageCount` only, but the 
handler now treats `sl:page-group` as opening/closing page `<div>`s via 
`slPageCount`. For documents that use `sl:page-group` without `sf:page-start`, 
this will still report a page count of 0 even though page divs were 
emitted/closed. Consider setting `Office.PAGE_COUNT` from `pageCount + 
slPageCount` (or otherwise aligning the metadata with the counters used for 
paging).



##########
tika-core/src/main/java/org/apache/tika/sax/StrictXHTMLValidator.java:
##########
@@ -0,0 +1,227 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.sax;
+
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.HashSet;
+import java.util.Set;
+
+import org.xml.sax.Attributes;
+import org.xml.sax.ContentHandler;
+import org.xml.sax.SAXException;
+
+/**
+ * A SAX content handler decorator that enforces XHTML well-formedness on the
+ * incoming event stream. Any parser that emits an event sequence that would
+ * produce malformed XHTML triggers a {@link SAXException} synchronously — the
+ * stack trace points at the parser code that made the offending call, instead
+ * of surfacing later as a parse error on the serialized output.
+ * <p>
+ * Invariants enforced:
+ * <ul>
+ *   <li>{@code startDocument} is called at most once.</li>
+ *   <li>No SAX events arrive after {@code endDocument}.</li>
+ *   <li>Every {@code endElement} matches the topmost open {@code startElement}
+ *       (no cross-nesting like {@code 
&lt;a&gt;&lt;b&gt;&lt;/a&gt;&lt;/b&gt;}).</li>
+ *   <li>The element stack is empty when {@code endDocument} fires (no unclosed
+ *       elements left dangling by an exception path).</li>
+ *   <li>Within a single {@code startElement}, no two attributes share the same
+ *       (namespaceURI, localName) pair (the bug class that produces
+ *       {@code &lt;div class="x" class="y"&gt;}).</li>
+ * </ul>
+ * Use as a decorator wrapping the real handler. It passes every event through
+ * to the downstream handler after validation, so any normal text/XHTML capture
+ * still works.
+ */
+public class StrictXHTMLValidator extends ContentHandlerDecorator {
+
+    private final Deque<QName> openElements = new ArrayDeque<>();
+    private boolean documentStarted;
+    private boolean documentEnded;
+
+    public StrictXHTMLValidator(ContentHandler handler) {
+        super(handler);
+    }
+
+    @Override
+    public void startDocument() throws SAXException {
+        if (documentStarted) {
+            throw new SAXException("StrictXHTMLValidator: startDocument called 
twice");
+        }
+        if (documentEnded) {
+            throw new SAXException(
+                    "StrictXHTMLValidator: startDocument after endDocument");
+        }
+        documentStarted = true;
+        super.startDocument();
+    }
+
+    @Override
+    public void endDocument() throws SAXException {
+        if (documentEnded) {
+            throw new SAXException("StrictXHTMLValidator: endDocument called 
twice");
+        }
+        if (!openElements.isEmpty()) {
+            throw new SAXException(
+                    "StrictXHTMLValidator: endDocument with " + 
openElements.size()
+                            + " unclosed element(s); topmost was <"
+                            + openElements.peek().qOrLocal() + ">");
+        }
+        documentEnded = true;
+        super.endDocument();
+    }
+
+    @Override
+    public void startElement(String uri, String localName, String qName, 
Attributes attrs)
+            throws SAXException {
+        ensureNotEnded("startElement <" + display(qName, localName) + ">");
+        checkAttributesUnique(qName, localName, attrs);
+        openElements.push(new QName(uri, localName, qName));
+        super.startElement(uri, localName, qName, attrs);
+    }
+
+    @Override
+    public void endElement(String uri, String localName, String qName) throws 
SAXException {
+        ensureNotEnded("endElement </" + display(qName, localName) + ">");
+        if (openElements.isEmpty()) {
+            throw new SAXException(
+                    "StrictXHTMLValidator: endElement </" + display(qName, 
localName)
+                            + "> with no matching startElement");
+        }
+        QName top = openElements.pop();
+        if (!top.matches(uri, localName, qName)) {
+            throw new SAXException(
+                    "StrictXHTMLValidator: endElement </" + display(qName, 
localName)
+                            + "> does not match topmost open element <"
+                            + top.qOrLocal() + ">");
+        }
+        super.endElement(uri, localName, qName);
+    }
+
+    @Override
+    public void characters(char[] ch, int start, int length) throws 
SAXException {
+        ensureNotEnded("characters");
+        super.characters(ch, start, length);
+    }
+
+    @Override
+    public void ignorableWhitespace(char[] ch, int start, int length) throws 
SAXException {
+        ensureNotEnded("ignorableWhitespace");
+        super.ignorableWhitespace(ch, start, length);
+    }
+
+    @Override
+    public void processingInstruction(String target, String data) throws 
SAXException {
+        ensureNotEnded("processingInstruction");
+        super.processingInstruction(target, data);
+    }
+
+    @Override
+    public void startPrefixMapping(String prefix, String uri) throws 
SAXException {
+        ensureNotEnded("startPrefixMapping");
+        super.startPrefixMapping(prefix, uri);
+    }
+
+    @Override
+    public void endPrefixMapping(String prefix) throws SAXException {
+        ensureNotEnded("endPrefixMapping");
+        super.endPrefixMapping(prefix);
+    }
+
+    @Override
+    public void skippedEntity(String name) throws SAXException {
+        ensureNotEnded("skippedEntity");
+        super.skippedEntity(name);
+    }
+
+    private void ensureNotEnded(String event) throws SAXException {
+        if (documentEnded) {
+            throw new SAXException(
+                    "StrictXHTMLValidator: " + event + " arrived after 
endDocument");
+        }
+    }
+
+    private void checkAttributesUnique(String elementQName, String 
elementLocalName,
+                                       Attributes attrs) throws SAXException {
+        int n = attrs.getLength();
+        if (n < 2) {
+            return;
+        }
+        // (uri, localName) pairs must be unique per the XML namespaces spec.
+        // We also check raw qnames because Tika's serializers emit by qname 
and
+        // duplicate qnames produce malformed XHTML even when localnames 
differ.
+        Set<String> seenUriLocal = new HashSet<>(n);
+        Set<String> seenQNames = new HashSet<>(n);
+        for (int i = 0; i < n; i++) {
+            String uri = nullSafe(attrs.getURI(i));
+            String local = nullSafe(attrs.getLocalName(i));
+            String qn = nullSafe(attrs.getQName(i));
+            String key = uri + "" + local;

Review Comment:
   The string literal used as a separator in `String key = uri + "" + local;` 
is currently an embedded control character in the source. That’s easy to 
corrupt in editors/diffs and can trip style/tooling. Consider using an explicit 
escape/char constant (e.g., `uri + "\u0001" + local` or `uri + '\u0001' + 
local`) or a small helper key object.
   



##########
tika-core/src/main/java/org/apache/tika/extractor/ParsingEmbeddedDocumentExtractor.java:
##########
@@ -178,10 +188,12 @@ public void parseEmbedded(
             recordException(e, context);
         } finally {
             tis.removeCloseShield();
-        }
-
-        if (outputHtml) {
-            handler.endElement(XHTML, "div", "div");
+            // Always close the package-entry div so XHTML output stays 
well-formed
+            // even when the inner parse throws (e.g., zip-bomb depth limits).
+            if (outputHtml) {
+                balancer.drainOpenElements();
+                handler.endElement(XHTML, "div", "div");
+            }

Review Comment:
   `balancer.drainOpenElements()` is executed unconditionally in the `finally` 
block when `outputHtml` is true. This will also close any elements left open on 
the happy path, which can mask embedded-parser XHTML bugs that would otherwise 
be caught by the new `StrictXHTMLValidator` usage in tests. Consider draining 
only when an exception from the embedded parse was caught/swallowed (e.g., 
track a boolean).





> Validate xhtml output, generally
> --------------------------------
>
>                 Key: TIKA-4728
>                 URL: https://issues.apache.org/jira/browse/TIKA-4728
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
>
> There's a bug in the xml output that we're writing for specific js attached 
> in a specific way in PDFs. We should fix that, but we should add more 
> general, more robust testing that we can actually parse our xhtml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4728) Validate xhtml output, generally

Reply via email to