[
https://issues.apache.org/jira/browse/TIKA-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17914187#comment-17914187
]
Tilman Hausherr commented on TIKA-4369:
---------------------------------------
Oh, I should have found this. It reminds me of the EU AstraZeneca redaction
fail where interesting text was found in the bookmarks.
> Pages extracted twice
> ---------------------
>
> Key: TIKA-4369
> URL: https://issues.apache.org/jira/browse/TIKA-4369
> Project: Tika
> Issue Type: Bug
> Components: parser, tika-app
> Affects Versions: 1.27, 2.9.2, 3.0.0
> Reporter: Tilman Hausherr
> Priority: Major
> Attachments: PDFBOX-4417-001031.pdf, result.htm, result.json,
> result.txt
>
>
> Parts of pages 1 and 2 are extracted twice when I run tika-app with default
> settings. This isn't new, it also happens with 1.27. The duplicate part
> starts with "Improving Generic Drug Review Performance", after the content of
> page 4. It doesn't happen with PDFBox extractText.
> I did some research for a few hours but didn't find anything. Before I start
> digging deeper (e.g. in the PDFBox stripper), I wonder if there's something
> obvious that I missed?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)