[jira] [Commented] (TIKA-4369) Pages extracted twice

Tilman Hausherr (Jira) Fri, 17 Jan 2025 09:20:21 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17914187#comment-17914187
 ]


Tilman Hausherr commented on TIKA-4369:
---------------------------------------

Oh, I should have found this. It reminds me of the EU AstraZeneca redaction 
fail where interesting text was found in the bookmarks.

> Pages extracted twice
> ---------------------
>
>                 Key: TIKA-4369
>                 URL: https://issues.apache.org/jira/browse/TIKA-4369
>             Project: Tika
>          Issue Type: Bug
>          Components: parser, tika-app
>    Affects Versions: 1.27, 2.9.2, 3.0.0
>            Reporter: Tilman Hausherr
>            Priority: Major
>         Attachments: PDFBOX-4417-001031.pdf, result.htm, result.json, 
> result.txt
>
>
> Parts of pages 1 and 2 are extracted twice when I run tika-app with default 
> settings. This isn't new, it also happens with 1.27. The duplicate part 
> starts with "Improving Generic Drug Review Performance", after the content of 
> page 4. It doesn't happen with PDFBox extractText.
> I did some research for a few hours but didn't find anything. Before I start 
> digging deeper (e.g. in the PDFBox stripper), I wonder if there's something 
> obvious that I missed?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4369) Pages extracted twice

Reply via email to