sungwon kim created PDFBOX-5090: ----------------------------------- Summary: Missing text extraction under certain conditions starting with apache pdfbox 2.0.18 Key: PDFBOX-5090 URL: https://issues.apache.org/jira/browse/PDFBOX-5090 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 2.0.22, 2.0.21, 2.0.20, 2.0.19, 2.0.18 Environment: jdk 1.8, apache pdfbox, fontbox 2.0.18~, windows 10 Reporter: sungwon kim Attachments: 128채널심장전기도시스템을위한3차원매핑소프트웨어개발.pdf, 128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG, textstripper_2.0.17_128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG, textstripper_2.0.17_独立財政機関をめぐる論点整理_3p_top.PNG, textstripper_2.0.18_128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG, textstripper_2.0.18_独立財政機関をめぐる論点整理_3p_top.PNG, 独立財政機関をめぐる論点整理.pdf, 独立財政機関をめぐる論点整理_3p_top.PNG
When calling PDFTextStripper.getText() function on pdfbox 2.0.18 or later, it fails to extract text with any condition. It is suspected that the missing text extraction phenomenon is associated with either the font type or the font size. I have attached the text extraction results of version 2.0.17 and version 2.0.18 and the sample data used for the test. code {code:java} PDDocument pdDocument = PDDocument.load(new File(path)); PDFTextStripper stripper = new PDFTextStripper(); {code} dependencies {code:java} <properties> <apache.pdfbox.version>2.0.18</apache.pdfbox.version> </properties> <dependencies> <dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>pdfbox</artifactId> <version>${apache.pdfbox.version}</version> </dependency> <dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>fontbox</artifactId> <version>${apache.pdfbox.version}</version> </dependency> <dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>xmpbox</artifactId> <version>${apache.pdfbox.version}</version> </dependency> </dependencies> {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org