[ https://issues.apache.org/jira/browse/TIKA-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17913775#comment-17913775 ]
Tim Allison edited comment on TIKA-4369 at 1/16/25 3:18 PM: ------------------------------------------------------------ Thank you [~tilman], let me take a look before you dig further into it. was (Author: talli...@mitre.org): Thank you [~tilman], let me take a look before you dig into it. > Pages extracted twice > --------------------- > > Key: TIKA-4369 > URL: https://issues.apache.org/jira/browse/TIKA-4369 > Project: Tika > Issue Type: Bug > Components: parser, tika-app > Affects Versions: 1.27, 2.9.2, 3.0.0 > Reporter: Tilman Hausherr > Priority: Major > Attachments: PDFBOX-4417-001031.pdf, result.htm, result.json, > result.txt > > > Parts of pages 1 and 2 are extracted twice when I run tika-app with default > settings. This isn't new, it also happens with 1.27. The duplicate part > starts with "Improving Generic Drug Review Performance", after the content of > page 4. It doesn't happen with PDFBox extractText. > I did some research for a few hours but didn't find anything. Before I start > digging deeper (e.g. in the PDFBox stripper), I wonder if there's something > obvious that I missed? -- This message was sent by Atlassian Jira (v8.20.10#820010)