[jira] [Created] (TIKA-4369) Pages extracted twice

Tilman Hausherr (Jira) Thu, 16 Jan 2025 05:57:04 -0800

Tilman Hausherr created TIKA-4369:
-------------------------------------

             Summary: Pages extracted twice
                 Key: TIKA-4369
                 URL: https://issues.apache.org/jira/browse/TIKA-4369
             Project: Tika
          Issue Type: Bug
          Components: parser, tika-app
    Affects Versions: 3.0.0, 2.9.2, 1.27
            Reporter: Tilman Hausherr
         Attachments: PDFBOX-4417-001031.pdf, result.htm, result.json, 
result.txt


Parts of pages 1 and 2 are extracted twice when I run tika-app with default 
settings. This isn't new, it also happens with 1.27. The duplicate part starts 
with "Improving Generic Drug Review Performance", after the content of page 4. 
It doesn't happen with PDFBox extractText.

I did some research for a few hours but didn't find anything. Before I start 
digging deeper (e.g. in the PDFBox stripper), I wonder if there's something 
obvious that I missed?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (TIKA-4369) Pages extracted twice

Reply via email to