Copilot commented on code in PR #2817:
URL: https://github.com/apache/tika/pull/2817#discussion_r3271181241
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-apple-module/src/main/java/org/apache/tika/parser/iwork/PagesContentHandler.java:
##########
@@ -59,7 +59,9 @@ class PagesContentHandler extends DefaultHandler {
@Override
public void endDocument() throws SAXException {
metadata.set(Office.PAGE_COUNT, String.valueOf(pageCount));
- if (pageCount > 0) {
+ // Either sf:page-start or sl:page-group opens a <div>; close the
+ // last open one regardless of which counter tracked it.
+ if (pageCount + slPageCount > 0) {
doFooter();
xhtml.endElement("div");
Review Comment:
`endDocument()` sets `Office.PAGE_COUNT` from `pageCount` only, but the
handler now treats `sl:page-group` as opening/closing page `<div>`s via
`slPageCount`. For documents that use `sl:page-group` without `sf:page-start`,
this will still report a page count of 0 even though page divs were
emitted/closed. Consider setting `Office.PAGE_COUNT` from `pageCount +
slPageCount` (or otherwise aligning the metadata with the counters used for
paging).
##########
tika-core/src/main/java/org/apache/tika/sax/StrictXHTMLValidator.java:
##########
@@ -0,0 +1,227 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.sax;
+
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.HashSet;
+import java.util.Set;
+
+import org.xml.sax.Attributes;
+import org.xml.sax.ContentHandler;
+import org.xml.sax.SAXException;
+
+/**
+ * A SAX content handler decorator that enforces XHTML well-formedness on the
+ * incoming event stream. Any parser that emits an event sequence that would
+ * produce malformed XHTML triggers a {@link SAXException} synchronously — the
+ * stack trace points at the parser code that made the offending call, instead
+ * of surfacing later as a parse error on the serialized output.
+ * <p>
+ * Invariants enforced:
+ * <ul>
+ * <li>{@code startDocument} is called at most once.</li>
+ * <li>No SAX events arrive after {@code endDocument}.</li>
+ * <li>Every {@code endElement} matches the topmost open {@code startElement}
+ * (no cross-nesting like {@code
<a><b></a></b>}).</li>
+ * <li>The element stack is empty when {@code endDocument} fires (no unclosed
+ * elements left dangling by an exception path).</li>
+ * <li>Within a single {@code startElement}, no two attributes share the same
+ * (namespaceURI, localName) pair (the bug class that produces
+ * {@code <div class="x" class="y">}).</li>
+ * </ul>
+ * Use as a decorator wrapping the real handler. It passes every event through
+ * to the downstream handler after validation, so any normal text/XHTML capture
+ * still works.
+ */
+public class StrictXHTMLValidator extends ContentHandlerDecorator {
+
+ private final Deque<QName> openElements = new ArrayDeque<>();
+ private boolean documentStarted;
+ private boolean documentEnded;
+
+ public StrictXHTMLValidator(ContentHandler handler) {
+ super(handler);
+ }
+
+ @Override
+ public void startDocument() throws SAXException {
+ if (documentStarted) {
+ throw new SAXException("StrictXHTMLValidator: startDocument called
twice");
+ }
+ if (documentEnded) {
+ throw new SAXException(
+ "StrictXHTMLValidator: startDocument after endDocument");
+ }
+ documentStarted = true;
+ super.startDocument();
+ }
+
+ @Override
+ public void endDocument() throws SAXException {
+ if (documentEnded) {
+ throw new SAXException("StrictXHTMLValidator: endDocument called
twice");
+ }
+ if (!openElements.isEmpty()) {
+ throw new SAXException(
+ "StrictXHTMLValidator: endDocument with " +
openElements.size()
+ + " unclosed element(s); topmost was <"
+ + openElements.peek().qOrLocal() + ">");
+ }
+ documentEnded = true;
+ super.endDocument();
+ }
+
+ @Override
+ public void startElement(String uri, String localName, String qName,
Attributes attrs)
+ throws SAXException {
+ ensureNotEnded("startElement <" + display(qName, localName) + ">");
+ checkAttributesUnique(qName, localName, attrs);
+ openElements.push(new QName(uri, localName, qName));
+ super.startElement(uri, localName, qName, attrs);
+ }
+
+ @Override
+ public void endElement(String uri, String localName, String qName) throws
SAXException {
+ ensureNotEnded("endElement </" + display(qName, localName) + ">");
+ if (openElements.isEmpty()) {
+ throw new SAXException(
+ "StrictXHTMLValidator: endElement </" + display(qName,
localName)
+ + "> with no matching startElement");
+ }
+ QName top = openElements.pop();
+ if (!top.matches(uri, localName, qName)) {
+ throw new SAXException(
+ "StrictXHTMLValidator: endElement </" + display(qName,
localName)
+ + "> does not match topmost open element <"
+ + top.qOrLocal() + ">");
+ }
+ super.endElement(uri, localName, qName);
+ }
+
+ @Override
+ public void characters(char[] ch, int start, int length) throws
SAXException {
+ ensureNotEnded("characters");
+ super.characters(ch, start, length);
+ }
+
+ @Override
+ public void ignorableWhitespace(char[] ch, int start, int length) throws
SAXException {
+ ensureNotEnded("ignorableWhitespace");
+ super.ignorableWhitespace(ch, start, length);
+ }
+
+ @Override
+ public void processingInstruction(String target, String data) throws
SAXException {
+ ensureNotEnded("processingInstruction");
+ super.processingInstruction(target, data);
+ }
+
+ @Override
+ public void startPrefixMapping(String prefix, String uri) throws
SAXException {
+ ensureNotEnded("startPrefixMapping");
+ super.startPrefixMapping(prefix, uri);
+ }
+
+ @Override
+ public void endPrefixMapping(String prefix) throws SAXException {
+ ensureNotEnded("endPrefixMapping");
+ super.endPrefixMapping(prefix);
+ }
+
+ @Override
+ public void skippedEntity(String name) throws SAXException {
+ ensureNotEnded("skippedEntity");
+ super.skippedEntity(name);
+ }
+
+ private void ensureNotEnded(String event) throws SAXException {
+ if (documentEnded) {
+ throw new SAXException(
+ "StrictXHTMLValidator: " + event + " arrived after
endDocument");
+ }
+ }
+
+ private void checkAttributesUnique(String elementQName, String
elementLocalName,
+ Attributes attrs) throws SAXException {
+ int n = attrs.getLength();
+ if (n < 2) {
+ return;
+ }
+ // (uri, localName) pairs must be unique per the XML namespaces spec.
+ // We also check raw qnames because Tika's serializers emit by qname
and
+ // duplicate qnames produce malformed XHTML even when localnames
differ.
+ Set<String> seenUriLocal = new HashSet<>(n);
+ Set<String> seenQNames = new HashSet<>(n);
+ for (int i = 0; i < n; i++) {
+ String uri = nullSafe(attrs.getURI(i));
+ String local = nullSafe(attrs.getLocalName(i));
+ String qn = nullSafe(attrs.getQName(i));
+ String key = uri + "" + local;
Review Comment:
The string literal used as a separator in `String key = uri + "" + local;`
is currently an embedded control character in the source. That’s easy to
corrupt in editors/diffs and can trip style/tooling. Consider using an explicit
escape/char constant (e.g., `uri + "\u0001" + local` or `uri + '\u0001' +
local`) or a small helper key object.
##########
tika-core/src/main/java/org/apache/tika/extractor/ParsingEmbeddedDocumentExtractor.java:
##########
@@ -178,10 +188,12 @@ public void parseEmbedded(
recordException(e, context);
} finally {
tis.removeCloseShield();
- }
-
- if (outputHtml) {
- handler.endElement(XHTML, "div", "div");
+ // Always close the package-entry div so XHTML output stays
well-formed
+ // even when the inner parse throws (e.g., zip-bomb depth limits).
+ if (outputHtml) {
+ balancer.drainOpenElements();
+ handler.endElement(XHTML, "div", "div");
+ }
Review Comment:
`balancer.drainOpenElements()` is executed unconditionally in the `finally`
block when `outputHtml` is true. This will also close any elements left open on
the happy path, which can mask embedded-parser XHTML bugs that would otherwise
be caught by the new `StrictXHTMLValidator` usage in tests. Consider draining
only when an exception from the embedded parse was caught/swallowed (e.g.,
track a boolean).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]