sungwon kim created PDFBOX-5090:
-----------------------------------
Summary: Missing text extraction under certain conditions starting
with apache pdfbox 2.0.18
Key: PDFBOX-5090
URL: https://issues.apache.org/jira/browse/PDFBOX-5090
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 2.0.22, 2.0.21, 2.0.20, 2.0.19, 2.0.18
Environment: jdk 1.8, apache pdfbox, fontbox 2.0.18~, windows 10
Reporter: sungwon kim
Attachments: 128채널심장전기도시스템을위한3차원매핑소프트웨어개발.pdf,
128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG,
textstripper_2.0.17_128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG,
textstripper_2.0.17_独立財政機関をめぐる論点整理_3p_top.PNG,
textstripper_2.0.18_128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG,
textstripper_2.0.18_独立財政機関をめぐる論点整理_3p_top.PNG, 独立財政機関をめぐる論点整理.pdf,
独立財政機関をめぐる論点整理_3p_top.PNG
When calling PDFTextStripper.getText() function on pdfbox 2.0.18 or later, it
fails to extract text with any condition.
It is suspected that the missing text extraction phenomenon is associated with
either the font type or the font size.
I have attached the text extraction results of version 2.0.17 and version
2.0.18 and the sample data used for the test.
code
{code:java}
PDDocument pdDocument = PDDocument.load(new File(path));
PDFTextStripper stripper = new PDFTextStripper();
{code}
dependencies
{code:java}
<properties>
<apache.pdfbox.version>2.0.18</apache.pdfbox.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>${apache.pdfbox.version}</version>
</dependency>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>fontbox</artifactId>
<version>${apache.pdfbox.version}</version>
</dependency>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>xmpbox</artifactId>
<version>${apache.pdfbox.version}</version>
</dependency>
</dependencies>
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]