Tilman Hausherr created TIKA-4369:
-------------------------------------
Summary: Pages extracted twice
Key: TIKA-4369
URL: https://issues.apache.org/jira/browse/TIKA-4369
Project: Tika
Issue Type: Bug
Components: parser, tika-app
Affects Versions: 3.0.0, 2.9.2, 1.27
Reporter: Tilman Hausherr
Attachments: PDFBOX-4417-001031.pdf, result.htm, result.json,
result.txt
Parts of pages 1 and 2 are extracted twice when I run tika-app with default
settings. This isn't new, it also happens with 1.27. The duplicate part starts
with "Improving Generic Drug Review Performance", after the content of page 4.
It doesn't happen with PDFBox extractText.
I did some research for a few hours but didn't find anything. Before I start
digging deeper (e.g. in the PDFBox stripper), I wonder if there's something
obvious that I missed?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)