[
https://issues.apache.org/jira/browse/TIKA-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081887#comment-18081887
]
ASF GitHub Bot commented on TIKA-4728:
--------------------------------------
Copilot commented on code in PR #2817:
URL: https://github.com/apache/tika/pull/2817#discussion_r3263442705
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java:
##########
@@ -672,6 +672,18 @@ void handleGeneralTextContainingPart(String contentType,
String xhtmlClassLabel,
} catch (IOException | TikaException e) {
parentMetadata.add(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING,
ExceptionUtils.getStackTrace(e));
+ } catch (SAXException e) {
+ // Don't let a per-part SAX failure cancel the rest of
+ // the loop
> Validate xhtml output, generally
> --------------------------------
>
> Key: TIKA-4728
> URL: https://issues.apache.org/jira/browse/TIKA-4728
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Major
>
> There's a bug in the xml output that we're writing for specific js attached
> in a specific way in PDFs. We should fix that, but we should add more
> general, more robust testing that we can actually parse our xhtml.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)