Copilot commented on code in PR #2817:
URL: https://github.com/apache/tika/pull/2817#discussion_r3251972338


##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java:
##########
@@ -1537,6 +1537,20 @@ public void testMaxPagesInvalidValue() {
         config.setMaxPages(1);
     }
 
+    // TIKA-XXXX: handleDestinationOrAction pre-populated class/type on the 
action div,
+    // then processJavaScriptAction appended a second class/type for 
PDActionJavaScript
+    // actions, producing a div with duplicate attributes that SAX parsers 
reject.
+    // TikaTest.getXML wraps with StrictXHTMLValidator, so a regression makes
+    // this test throw at the offending SAX event.

Review Comment:
   The added header comment still refers to `TIKA-XXXX`. Please update this to 
the actual issue ID (looks like `TIKA-4728` from the PR title) so the 
regression test is traceable in release notes / history.



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-miscoffice-module/src/main/java/org/apache/tika/parser/epub/EpubParser.java:
##########
@@ -607,18 +628,44 @@ public void startElement(String uri, String localName, 
String name, Attributes a
             if (needToRewrite) {
                 AttributesImpl simplifiedAtts = new AttributesImpl();
                 for (int i = 0; i < atts.getLength(); i++) {
-                    simplifiedAtts.addAttribute("", atts.getLocalName(i), 
atts.getLocalName(i),
+                    String localAttName = atts.getLocalName(i);
+                    // Stripping the namespace prefix can collapse two distinct
+                    // qnames onto one local name (e.g. xml:lang + lang). The
+                    // serialized XHTML must have unique attribute names, so
+                    // keep the first occurrence and drop later duplicates.
+                    if (simplifiedAtts.getIndex("", localAttName) >= 0) {
+                        continue;
+                    }
+                    simplifiedAtts.addAttribute("", localAttName, localAttName,
                             atts.getType(i), atts.getValue(i));
                 }
                 super.startElement(uri, localName, localName, simplifiedAtts);
             } else {
                 super.startElement(uri, localName, localName, atts);
             }
+            openElements.push(localName);
         }
 
         @Override
         public void endElement(String uri, String localName, String name) 
throws SAXException {
             super.endElement(uri, localName, localName);
+            if (!openElements.isEmpty()) {
+                openElements.pop();
+            }
+        }
+
+        /**
+         * Emits endElements for every element still on the stack, in reverse
+         * order. Intended ONLY for the catch arm of an EPUB spine item whose
+         * inner parser threw mid-stream: the outer XHTMLContentHandler still
+         * needs balanced events, but the malformed spine left some elements
+         * dangling on the wire.
+         */
+        void drainOpen() throws SAXException {
+            while (!openElements.isEmpty()) {
+                String el = openElements.pop();
+                super.endElement("", el, el);
+            }

Review Comment:
   `drainOpen()` emits `endElement` events with an empty namespace URI 
(`super.endElement("", el, el)`), even though `startElement` forwards the 
original `uri`. Per SAX conventions, endElement's (uri, localName/qName) should 
match the corresponding startElement; using a different URI can confuse 
downstream handlers/serializers that are namespace-aware. Consider tracking the 
URI (and ideally localName/qName) on `openElements` so `drainOpen()` can emit 
matching endElement events.



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLTikaBodyPartHandler.java:
##########
@@ -214,6 +220,36 @@ public java.util.Set<String> getEmittedCommentIds() {
         return emittedCommentIds;
     }
 
+    /**
+     * Closes any XHTML elements this handler opened but didn't get a chance to
+     * close, in the proper nesting order. Intended ONLY for the catch arm of a
+     * caller that swallowed a {@link SAXException} from the inner SAX parser;
+     * the normal happy-path flow keeps the trackers in sync via endParagraph
+     * / endTableCell / endTableRow / endTable / FormattingTagManager.closeAll.
+     * Without this, swallowed exceptions leave dangling {@code <p>}, {@code 
<td>},
+     * {@code <tr>}, {@code <table>}, or formatting tags on the wire that
+     * collide with the outer {@code </body></html>}.
+     */
+    public void closeAnyPending() throws SAXException {
+        formattingTags.closeAll();
+        if (tableCellDepth > 0) {

Review Comment:
   `closeAnyPending()` only closes a single `<td>` when `tableCellDepth > 0`, 
but `tableCellDepth` is an integer depth counter and can be >1 (e.g., nested 
tables inside cells). This cleanup should close *all* pending cells (loop while 
`tableCellDepth > 0`) to avoid leaving dangling `<td>` elements after a 
swallowed SAXException.
   



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-cad-module/src/main/java/org/apache/tika/parser/prt/PRTParser.java:
##########
@@ -77,6 +77,7 @@ public void parse(TikaInputStream tis, ContentHandler 
handler, Metadata metadata
                       ParseContext context) throws IOException, SAXException, 
TikaException {
 
         XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata, 
context);
+        xhtml.startDocument();
         Last5 l5 = new Last5();

Review Comment:
   `xhtml.startDocument()`/`endDocument()` are not guarded by a try/finally. If 
an `IOException`/`SAXException` occurs during header reads or the main loop, 
`endDocument()` will be skipped and the XHTML stream will remain unterminated. 
Wrap the parsing body in try/finally so `endDocument()` is always called after 
a successful `startDocument()`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to