[jira] [Commented] (TIKA-4728) Validate xhtml output, generally

ASF GitHub Bot (Jira) Fri, 15 May 2026 20:58:10 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081341#comment-18081341
 ]


ASF GitHub Bot commented on TIKA-4728:
--------------------------------------

Copilot commented on code in PR #2817:
URL: https://github.com/apache/tika/pull/2817#discussion_r3251972338


##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java:
##########
@@ -1537,6 +1537,20 @@ public void testMaxPagesInvalidValue() {
         config.setMaxPages(1);
     }
 
+    // TIKA-XXXX: handleDestinationOrAction pre-populated class/type on the 
action div,
+    // then processJavaScriptAction appended a second class/type for 
PDActionJavaScript
+    // actions, producing a div with duplicate attributes that SAX parsers 
reject.
+    // TikaTest.getXML wraps with StrictXHTMLValidator, so a regression makes
+    // this test throw at the offending SAX event.

Review Comment:
   The added header comment still refers to `TIKA-XXXX`. Please update this to 
the actual issue ID (looks like `TIKA-4728` from the PR title) so the 
regression test is traceable in release notes / history.



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-miscoffice-module/src/main/java/org/apache/tika/parser/epub/EpubParser.java:
##########
@@ -607,18 +628,44 @@ public void startElement(String uri, String localName, 
String name, Attributes a
             if (needToRewrite) {
                 AttributesImpl simplifiedAtts = new AttributesImpl();
                 for (int i = 0; i < atts.getLength(); i++) {
-                    simplifiedAtts.addAttribute("", atts.getLocalName(i), 
atts.getLocalName(i),
+                    String localAttName = atts.getLocalName(i);
+                    // Stripping the namespace prefix can collapse two distinct
+                    // qnames onto one local name (e.g. xml:lang + lang). The
+                    // serialized XHTML must have unique attribute names, so
+                    // keep the first occurrence and drop later duplicates.
+                    if (simplifiedAtts.getIndex("", localAttName) >= 0) {
+                        continue;
+                    }
+                    simplifiedAtts.addAttribute("", localAttName, localAttName,
                             atts.getType(i), atts.getValue(i));
                 }
                 super.startElement(uri, localName, localName, simplifiedAtts);
             } else {
                 super.startElement(uri, localName, localName, atts);
             }
+            openElements.push(localName);
         }
 
         @Override
         public void endElement(String uri, String localName, String name) 
throws SAXException {
             super.endElement(uri, localName, localName);
+            if (!openElements.isEmpty()) {
+                openElements.pop();
+            }
+        }
+
+        /**
+         * Emits endElements for every element still on the stack, in reverse
+         * order. Intended ONLY for the catch arm of an EPUB spine item whose
+         * inner parser threw mid-stream: the outer XHTMLContentHandler still
+         * needs balanced events, but the malformed spine left some elements
+         * dangling on the wire.
+         */
+        void drainOpen() throws SAXException {
+            while (!openElements.isEmpty()) {
+                String el = openElements.pop();
+                super.endElement("", el, el);
+            }

Review Comment:
   `drainOpen()` emits `endElement` events with an empty namespace URI 
(`super.endElement("", el, el)`), even though `startElement` forwards the 
original `uri`. Per SAX conventions, endElement's (uri, localName/qName) should 
match the corresponding startElement; using a different URI can confuse 
downstream handlers/serializers that are namespace-aware. Consider tracking the 
URI (and ideally localName/qName) on `openElements` so `drainOpen()` can emit 
matching endElement events.



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLTikaBodyPartHandler.java:
##########
@@ -214,6 +220,36 @@ public java.util.Set<String> getEmittedCommentIds() {
         return emittedCommentIds;
     }
 
+    /**
+     * Closes any XHTML elements this handler opened but didn't get a chance to
+     * close, in the proper nesting order. Intended ONLY for the catch arm of a
+     * caller that swallowed a {@link SAXException} from the inner SAX parser;
+     * the normal happy-path flow keeps the trackers in sync via endParagraph
+     * / endTableCell / endTableRow / endTable / FormattingTagManager.closeAll.
+     * Without this, swallowed exceptions leave dangling {@code <p>}, {@code 
<td>},
+     * {@code <tr>}, {@code <table>}, or formatting tags on the wire that
+     * collide with the outer {@code </body></html>}.
+     */
+    public void closeAnyPending() throws SAXException {
+        formattingTags.closeAll();
+        if (tableCellDepth > 0) {

Review Comment:
   `closeAnyPending()` only closes a single `<td>` when `tableCellDepth > 0`, 
but `tableCellDepth` is an integer depth counter and can be >1 (e.g., nested 
tables inside cells). This cleanup should close *all* pending cells (loop while 
`tableCellDepth > 0`) to avoid leaving dangling `<td>` elements after a 
swallowed SAXException.
   



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-cad-module/src/main/java/org/apache/tika/parser/prt/PRTParser.java:
##########
@@ -77,6 +77,7 @@ public void parse(TikaInputStream tis, ContentHandler 
handler, Metadata metadata
                       ParseContext context) throws IOException, SAXException, 
TikaException {
 
         XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata, 
context);
+        xhtml.startDocument();
         Last5 l5 = new Last5();

Review Comment:
   `xhtml.startDocument()`/`endDocument()` are not guarded by a try/finally. If 
an `IOException`/`SAXException` occurs during header reads or the main loop, 
`endDocument()` will be skipped and the XHTML stream will remain unterminated. 
Wrap the parsing body in try/finally so `endDocument()` is always called after 
a successful `startDocument()`.





> Validate xhtml output, generally
> --------------------------------
>
>                 Key: TIKA-4728
>                 URL: https://issues.apache.org/jira/browse/TIKA-4728
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
>
> There's a bug in the xml output that we're writing for specific js attached 
> in a specific way in PDFs. We should fix that, but we should add more 
> general, more robust testing that we can actually parse our xhtml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4728) Validate xhtml output, generally

Reply via email to