[jira] [Commented] (TIKA-4728) Validate xhtml output, generally

ASF GitHub Bot (Jira) Mon, 18 May 2026 20:18:05 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081887#comment-18081887
 ]


ASF GitHub Bot commented on TIKA-4728:
--------------------------------------

Copilot commented on code in PR #2817:
URL: https://github.com/apache/tika/pull/2817#discussion_r3263442705


##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java:
##########
@@ -672,6 +672,18 @@ void handleGeneralTextContainingPart(String contentType, 
String xhtmlClassLabel,
                     } catch (IOException | TikaException e) {
                         
parentMetadata.add(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING,
                                 ExceptionUtils.getStackTrace(e));
+                    } catch (SAXException e) {
+                        // Don't let a per-part SAX failure cancel the rest of
+                        // the loop 

> Validate xhtml output, generally
> --------------------------------
>
>                 Key: TIKA-4728
>                 URL: https://issues.apache.org/jira/browse/TIKA-4728
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
>
> There's a bug in the xml output that we're writing for specific js attached 
> in a specific way in PDFs. We should fix that, but we should add more 
> general, more robust testing that we can actually parse our xhtml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4728) Validate xhtml output, generally

Reply via email to