[
https://issues.apache.org/jira/browse/TIKA-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18082207#comment-18082207
]
ASF GitHub Bot commented on TIKA-4728:
--------------------------------------
Copilot commented on code in PR #2817:
URL: https://github.com/apache/tika/pull/2817#discussion_r3271181241
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-apple-module/src/main/java/org/apache/tika/parser/iwork/PagesContentHandler.java:
##########
@@ -59,7 +59,9 @@ class PagesContentHandler extends DefaultHandler {
@Override
public void endDocument() throws SAXException {
metadata.set(Office.PAGE_COUNT, String.valueOf(pageCount));
- if (pageCount > 0) {
+ // Either sf:page-start or sl:page-group opens a <div>; close the
+ // last open one regardless of which counter tracked it.
+ if (pageCount + slPageCount > 0) {
doFooter();
xhtml.endElement("div");
Review Comment:
`endDocument()` sets `Office.PAGE_COUNT` from `pageCount` only, but the
handler now treats `sl:page-group` as opening/closing page `<div>`s via
`slPageCount`. For documents that use `sl:page-group` without `sf:page-start`,
this will still report a page count of 0 even though page divs were
emitted/closed. Consider setting `Office.PAGE_COUNT` from `pageCount +
slPageCount` (or otherwise aligning the metadata with the counters used for
paging).
##########
tika-core/src/main/java/org/apache/tika/sax/StrictXHTMLValidator.java:
##########
@@ -0,0 +1,227 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.sax;
+
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.HashSet;
+import java.util.Set;
+
+import org.xml.sax.Attributes;
+import org.xml.sax.ContentHandler;
+import org.xml.sax.SAXException;
+
+/**
+ * A SAX content handler decorator that enforces XHTML well-formedness on the
+ * incoming event stream. Any parser that emits an event sequence that would
+ * produce malformed XHTML triggers a {@link SAXException} synchronously — the
+ * stack trace points at the parser code that made the offending call, instead
+ * of surfacing later as a parse error on the serialized output.
+ * <p>
+ * Invariants enforced:
+ * <ul>
+ * <li>{@code startDocument} is called at most once.</li>
+ * <li>No SAX events arrive after {@code endDocument}.</li>
+ * <li>Every {@code endElement} matches the topmost open {@code startElement}
+ * (no cross-nesting like {@code
<a><b></a></b>}).</li>
+ * <li>The element stack is empty when {@code endDocument} fires (no unclosed
+ * elements left dangling by an exception path).</li>
+ * <li>Within a single {@code startElement}, no two attributes share the same
+ * (namespaceURI, localName) pair (the bug class that produces
+ * {@code <div class="x" class="y">}).</li>
+ * </ul>
+ * Use as a decorator wrapping the real handler. It passes every event through
+ * to the downstream handler after validation, so any normal text/XHTML capture
+ * still works.
+ */
+public class StrictXHTMLValidator extends ContentHandlerDecorator {
+
+ private final Deque<QName> openElements = new ArrayDeque<>();
+ private boolean documentStarted;
+ private boolean documentEnded;
+
+ public StrictXHTMLValidator(ContentHandler handler) {
+ super(handler);
+ }
+
+ @Override
+ public void startDocument() throws SAXException {
+ if (documentStarted) {
+ throw new SAXException("StrictXHTMLValidator: startDocument called
twice");
+ }
+ if (documentEnded) {
+ throw new SAXException(
+ "StrictXHTMLValidator: startDocument after endDocument");
+ }
+ documentStarted = true;
+ super.startDocument();
+ }
+
+ @Override
+ public void endDocument() throws SAXException {
+ if (documentEnded) {
+ throw new SAXException("StrictXHTMLValidator: endDocument called
twice");
+ }
+ if (!openElements.isEmpty()) {
+ throw new SAXException(
+ "StrictXHTMLValidator: endDocument with " +
openElements.size()
+ + " unclosed element(s); topmost was <"
+ + openElements.peek().qOrLocal() + ">");
+ }
+ documentEnded = true;
+ super.endDocument();
+ }
+
+ @Override
+ public void startElement(String uri, String localName, String qName,
Attributes attrs)
+ throws SAXException {
+ ensureNotEnded("startElement <" + display(qName, localName) + ">");
+ checkAttributesUnique(qName, localName, attrs);
+ openElements.push(new QName(uri, localName, qName));
+ super.startElement(uri, localName, qName, attrs);
+ }
+
+ @Override
+ public void endElement(String uri, String localName, String qName) throws
SAXException {
+ ensureNotEnded("endElement </" + display(qName, localName) + ">");
+ if (openElements.isEmpty()) {
+ throw new SAXException(
+ "StrictXHTMLValidator: endElement </" + display(qName,
localName)
+ + "> with no matching startElement");
+ }
+ QName top = openElements.pop();
+ if (!top.matches(uri, localName, qName)) {
+ throw new SAXException(
+ "StrictXHTMLValidator: endElement </" + display(qName,
localName)
+ + "> does not match topmost open element <"
+ + top.qOrLocal() + ">");
+ }
+ super.endElement(uri, localName, qName);
+ }
+
+ @Override
+ public void characters(char[] ch, int start, int length) throws
SAXException {
+ ensureNotEnded("characters");
+ super.characters(ch, start, length);
+ }
+
+ @Override
+ public void ignorableWhitespace(char[] ch, int start, int length) throws
SAXException {
+ ensureNotEnded("ignorableWhitespace");
+ super.ignorableWhitespace(ch, start, length);
+ }
+
+ @Override
+ public void processingInstruction(String target, String data) throws
SAXException {
+ ensureNotEnded("processingInstruction");
+ super.processingInstruction(target, data);
+ }
+
+ @Override
+ public void startPrefixMapping(String prefix, String uri) throws
SAXException {
+ ensureNotEnded("startPrefixMapping");
+ super.startPrefixMapping(prefix, uri);
+ }
+
+ @Override
+ public void endPrefixMapping(String prefix) throws SAXException {
+ ensureNotEnded("endPrefixMapping");
+ super.endPrefixMapping(prefix);
+ }
+
+ @Override
+ public void skippedEntity(String name) throws SAXException {
+ ensureNotEnded("skippedEntity");
+ super.skippedEntity(name);
+ }
+
+ private void ensureNotEnded(String event) throws SAXException {
+ if (documentEnded) {
+ throw new SAXException(
+ "StrictXHTMLValidator: " + event + " arrived after
endDocument");
+ }
+ }
+
+ private void checkAttributesUnique(String elementQName, String
elementLocalName,
+ Attributes attrs) throws SAXException {
+ int n = attrs.getLength();
+ if (n < 2) {
+ return;
+ }
+ // (uri, localName) pairs must be unique per the XML namespaces spec.
+ // We also check raw qnames because Tika's serializers emit by qname
and
+ // duplicate qnames produce malformed XHTML even when localnames
differ.
+ Set<String> seenUriLocal = new HashSet<>(n);
+ Set<String> seenQNames = new HashSet<>(n);
+ for (int i = 0; i < n; i++) {
+ String uri = nullSafe(attrs.getURI(i));
+ String local = nullSafe(attrs.getLocalName(i));
+ String qn = nullSafe(attrs.getQName(i));
+ String key = uri + "" + local;
Review Comment:
The string literal used as a separator in `String key = uri + "" + local;`
is currently an embedded control character in the source. That’s easy to
corrupt in editors/diffs and can trip style/tooling. Consider using an explicit
escape/char constant (e.g., `uri + "\u0001" + local` or `uri + '\u0001' +
local`) or a small helper key object.
##########
tika-core/src/main/java/org/apache/tika/extractor/ParsingEmbeddedDocumentExtractor.java:
##########
@@ -178,10 +188,12 @@ public void parseEmbedded(
recordException(e, context);
} finally {
tis.removeCloseShield();
- }
-
- if (outputHtml) {
- handler.endElement(XHTML, "div", "div");
+ // Always close the package-entry div so XHTML output stays
well-formed
+ // even when the inner parse throws (e.g., zip-bomb depth limits).
+ if (outputHtml) {
+ balancer.drainOpenElements();
+ handler.endElement(XHTML, "div", "div");
+ }
Review Comment:
`balancer.drainOpenElements()` is executed unconditionally in the `finally`
block when `outputHtml` is true. This will also close any elements left open on
the happy path, which can mask embedded-parser XHTML bugs that would otherwise
be caught by the new `StrictXHTMLValidator` usage in tests. Consider draining
only when an exception from the embedded parse was caught/swallowed (e.g.,
track a boolean).
> Validate xhtml output, generally
> --------------------------------
>
> Key: TIKA-4728
> URL: https://issues.apache.org/jira/browse/TIKA-4728
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Major
>
> There's a bug in the xml output that we're writing for specific js attached
> in a specific way in PDFs. We should fix that, but we should add more
> general, more robust testing that we can actually parse our xhtml.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)