Copilot commented on code in PR #2817:
URL: https://github.com/apache/tika/pull/2817#discussion_r3263442705
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java:
##########
@@ -672,6 +672,18 @@ void handleGeneralTextContainingPart(String contentType,
String xhtmlClassLabel,
} catch (IOException | TikaException e) {
parentMetadata.add(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING,
ExceptionUtils.getStackTrace(e));
+ } catch (SAXException e) {
+ // Don't let a per-part SAX failure cancel the rest of
+ // the loop -- and crucially, don't let it propagate
+ // past the matching </div> emitted after this loop.
+ // The contentHandler's internal bookkeeping (e.g.,
+ // OOXMLTikaBodyPartHandler) may still have
<p>/<td>/etc.
+ // open after the partial parse; that cleanup needs a
+ // handle to the body handler, which this generic
method
+ // doesn't have. Tracked for a follow-up.
+ WriteLimitReachedException.throwIfWriteLimitReached(e);
+
parentMetadata.add(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING,
+ ExceptionUtils.getStackTrace(e));
Review Comment:
In this SAXException catch you continue the loop but don’t (and can’t)
repair the downstream ContentHandler’s element stack. If the embedded parse
aborted after emitting <p>/<td>/etc., the later endElement("div") will be
cross-nested or will throw, and the captured XHTML can still end up malformed.
Consider either (a) rethrowing SAXException here to avoid emitting a mismatched
closing </div>, or (b) changing this helper to take a handler that can be
drained/closed on SAXException (e.g., pass an OOXMLTikaBodyPartHandler
reference alongside the ContentHandler, or wrap contentHandler in a decorator
that tracks opens and can auto-close before leaving the catch).
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-cad-module/src/main/java/org/apache/tika/parser/prt/PRTParser.java:
##########
@@ -77,53 +77,61 @@ public void parse(TikaInputStream tis, ContentHandler
handler, Metadata metadata
ParseContext context) throws IOException, SAXException,
TikaException {
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata,
context);
- Last5 l5 = new Last5();
- int read;
+ xhtml.startDocument();
+ // try/finally so endDocument fires even if a header read, characters
+ // emit, or recursive helper throws
IOException/SAXException/TikaException
+ // mid-parse. Without this the captured XHTML would be left
unterminated.
+ try {
+ Last5 l5 = new Last5();
+ int read;
- // Try to get the creation date, which is YYYYMMDDhhmm
- byte[] header = new byte[30];
- IOUtils.readFully(tis, header);
- byte[] date = new byte[12];
- IOUtils.readFully(tis, date);
+ // Try to get the creation date, which is YYYYMMDDhhmm
+ byte[] header = new byte[30];
+ IOUtils.readFully(tis, header);
+ byte[] date = new byte[12];
+ IOUtils.readFully(tis, date);
- String dateStr = new String(date, US_ASCII);
- if (dateStr.startsWith("19") || dateStr.startsWith("20")) {
- String formattedDate = dateStr.substring(0, 4) + "-" +
dateStr.substring(4, 6) + "-" +
- dateStr.substring(6, 8) + "T" + dateStr.substring(8, 10) +
":" +
- dateStr.substring(10, 12) + ":00";
- metadata.set(TikaCoreProperties.CREATED, formattedDate);
- // TODO Metadata.DATE is used as modified, should it be here?
- metadata.set(TikaCoreProperties.CREATED, formattedDate);
- }
- metadata.set(Metadata.CONTENT_TYPE, PRT_MIME_TYPE);
+ String dateStr = new String(date, US_ASCII);
+ if (dateStr.startsWith("19") || dateStr.startsWith("20")) {
+ String formattedDate = dateStr.substring(0, 4) + "-" +
dateStr.substring(4, 6) + "-" +
+ dateStr.substring(6, 8) + "T" + dateStr.substring(8,
10) + ":" +
+ dateStr.substring(10, 12) + ":00";
+ metadata.set(TikaCoreProperties.CREATED, formattedDate);
+ // TODO Metadata.DATE is used as modified, should it be here?
+ metadata.set(TikaCoreProperties.CREATED, formattedDate);
Review Comment:
These two consecutive calls both set TikaCoreProperties.CREATED to the same
formattedDate value, which is redundant and likely a copy/paste error. If the
intent was to set a separate modified/date property, update the second
assignment accordingly; otherwise remove the duplicate set() to avoid confusion.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]