[jira] [Commented] (TIKA-4728) Validate xhtml output, generally

ASF GitHub Bot (Jira) Thu, 21 May 2026 06:14:08 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18082585#comment-18082585
 ]


ASF GitHub Bot commented on TIKA-4728:
--------------------------------------

tballison commented on code in PR #2817:
URL: https://github.com/apache/tika/pull/2817#discussion_r3281408563


##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-apple-module/src/main/java/org/apache/tika/parser/iwork/PagesContentHandler.java:
##########
@@ -59,7 +59,9 @@ class PagesContentHandler extends DefaultHandler {
     @Override
     public void endDocument() throws SAXException {
         metadata.set(Office.PAGE_COUNT, String.valueOf(pageCount));
-        if (pageCount > 0) {
+        // Either sf:page-start or sl:page-group opens a <div>; close the
+        // last open one regardless of which counter tracked it.
+        if (pageCount + slPageCount > 0) {
             doFooter();
             xhtml.endElement("div");

Review Comment:
   don't agree. 





> Validate xhtml output, generally
> --------------------------------
>
>                 Key: TIKA-4728
>                 URL: https://issues.apache.org/jira/browse/TIKA-4728
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
>
> There's a bug in the xml output that we're writing for specific js attached 
> in a specific way in PDFs. We should fix that, but we should add more 
> general, more robust testing that we can actually parse our xhtml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4728) Validate xhtml output, generally

Reply via email to