[
https://issues.apache.org/jira/browse/TIKA-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18082585#comment-18082585
]
ASF GitHub Bot commented on TIKA-4728:
--------------------------------------
tballison commented on code in PR #2817:
URL: https://github.com/apache/tika/pull/2817#discussion_r3281408563
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-apple-module/src/main/java/org/apache/tika/parser/iwork/PagesContentHandler.java:
##########
@@ -59,7 +59,9 @@ class PagesContentHandler extends DefaultHandler {
@Override
public void endDocument() throws SAXException {
metadata.set(Office.PAGE_COUNT, String.valueOf(pageCount));
- if (pageCount > 0) {
+ // Either sf:page-start or sl:page-group opens a <div>; close the
+ // last open one regardless of which counter tracked it.
+ if (pageCount + slPageCount > 0) {
doFooter();
xhtml.endElement("div");
Review Comment:
don't agree.
> Validate xhtml output, generally
> --------------------------------
>
> Key: TIKA-4728
> URL: https://issues.apache.org/jira/browse/TIKA-4728
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Major
>
> There's a bug in the xml output that we're writing for specific js attached
> in a specific way in PDFs. We should fix that, but we should add more
> general, more robust testing that we can actually parse our xhtml.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)