[
https://issues.apache.org/jira/browse/TIKA-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081341#comment-18081341
]
ASF GitHub Bot commented on TIKA-4728:
--------------------------------------
Copilot commented on code in PR #2817:
URL: https://github.com/apache/tika/pull/2817#discussion_r3251972338
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java:
##########
@@ -1537,6 +1537,20 @@ public void testMaxPagesInvalidValue() {
config.setMaxPages(1);
}
+ // TIKA-XXXX: handleDestinationOrAction pre-populated class/type on the
action div,
+ // then processJavaScriptAction appended a second class/type for
PDActionJavaScript
+ // actions, producing a div with duplicate attributes that SAX parsers
reject.
+ // TikaTest.getXML wraps with StrictXHTMLValidator, so a regression makes
+ // this test throw at the offending SAX event.
Review Comment:
The added header comment still refers to `TIKA-XXXX`. Please update this to
the actual issue ID (looks like `TIKA-4728` from the PR title) so the
regression test is traceable in release notes / history.
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-miscoffice-module/src/main/java/org/apache/tika/parser/epub/EpubParser.java:
##########
@@ -607,18 +628,44 @@ public void startElement(String uri, String localName,
String name, Attributes a
if (needToRewrite) {
AttributesImpl simplifiedAtts = new AttributesImpl();
for (int i = 0; i < atts.getLength(); i++) {
- simplifiedAtts.addAttribute("", atts.getLocalName(i),
atts.getLocalName(i),
+ String localAttName = atts.getLocalName(i);
+ // Stripping the namespace prefix can collapse two distinct
+ // qnames onto one local name (e.g. xml:lang + lang). The
+ // serialized XHTML must have unique attribute names, so
+ // keep the first occurrence and drop later duplicates.
+ if (simplifiedAtts.getIndex("", localAttName) >= 0) {
+ continue;
+ }
+ simplifiedAtts.addAttribute("", localAttName, localAttName,
atts.getType(i), atts.getValue(i));
}
super.startElement(uri, localName, localName, simplifiedAtts);
} else {
super.startElement(uri, localName, localName, atts);
}
+ openElements.push(localName);
}
@Override
public void endElement(String uri, String localName, String name)
throws SAXException {
super.endElement(uri, localName, localName);
+ if (!openElements.isEmpty()) {
+ openElements.pop();
+ }
+ }
+
+ /**
+ * Emits endElements for every element still on the stack, in reverse
+ * order. Intended ONLY for the catch arm of an EPUB spine item whose
+ * inner parser threw mid-stream: the outer XHTMLContentHandler still
+ * needs balanced events, but the malformed spine left some elements
+ * dangling on the wire.
+ */
+ void drainOpen() throws SAXException {
+ while (!openElements.isEmpty()) {
+ String el = openElements.pop();
+ super.endElement("", el, el);
+ }
Review Comment:
`drainOpen()` emits `endElement` events with an empty namespace URI
(`super.endElement("", el, el)`), even though `startElement` forwards the
original `uri`. Per SAX conventions, endElement's (uri, localName/qName) should
match the corresponding startElement; using a different URI can confuse
downstream handlers/serializers that are namespace-aware. Consider tracking the
URI (and ideally localName/qName) on `openElements` so `drainOpen()` can emit
matching endElement events.
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLTikaBodyPartHandler.java:
##########
@@ -214,6 +220,36 @@ public java.util.Set<String> getEmittedCommentIds() {
return emittedCommentIds;
}
+ /**
+ * Closes any XHTML elements this handler opened but didn't get a chance to
+ * close, in the proper nesting order. Intended ONLY for the catch arm of a
+ * caller that swallowed a {@link SAXException} from the inner SAX parser;
+ * the normal happy-path flow keeps the trackers in sync via endParagraph
+ * / endTableCell / endTableRow / endTable / FormattingTagManager.closeAll.
+ * Without this, swallowed exceptions leave dangling {@code <p>}, {@code
<td>},
+ * {@code <tr>}, {@code <table>}, or formatting tags on the wire that
+ * collide with the outer {@code </body></html>}.
+ */
+ public void closeAnyPending() throws SAXException {
+ formattingTags.closeAll();
+ if (tableCellDepth > 0) {
Review Comment:
`closeAnyPending()` only closes a single `<td>` when `tableCellDepth > 0`,
but `tableCellDepth` is an integer depth counter and can be >1 (e.g., nested
tables inside cells). This cleanup should close *all* pending cells (loop while
`tableCellDepth > 0`) to avoid leaving dangling `<td>` elements after a
swallowed SAXException.
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-cad-module/src/main/java/org/apache/tika/parser/prt/PRTParser.java:
##########
@@ -77,6 +77,7 @@ public void parse(TikaInputStream tis, ContentHandler
handler, Metadata metadata
ParseContext context) throws IOException, SAXException,
TikaException {
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata,
context);
+ xhtml.startDocument();
Last5 l5 = new Last5();
Review Comment:
`xhtml.startDocument()`/`endDocument()` are not guarded by a try/finally. If
an `IOException`/`SAXException` occurs during header reads or the main loop,
`endDocument()` will be skipped and the XHTML stream will remain unterminated.
Wrap the parsing body in try/finally so `endDocument()` is always called after
a successful `startDocument()`.
> Validate xhtml output, generally
> --------------------------------
>
> Key: TIKA-4728
> URL: https://issues.apache.org/jira/browse/TIKA-4728
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Major
>
> There's a bug in the xml output that we're writing for specific js attached
> in a specific way in PDFs. We should fix that, but we should add more
> general, more robust testing that we can actually parse our xhtml.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)