[
https://issues.apache.org/jira/browse/PDFBOX-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272634#comment-17272634
]
Tilman Hausherr commented on PDFBOX-5090:
-----------------------------------------
You mention "it fails to extract text with any condition", but you attached
images, not text. I tried with the 2.0.22 on "128채널심장전기도시스템을위한3차원매핑소프트웨어개발.pdf"
and I did get text extraction, see attachment.
[^128채널심장전기도시스템을위한3차원매핑소프트웨어개발.txt]
>From your images, it seems you mean a difference in text extraction. Where is
>this difference in the source files? On what page and where?
> Missing text extraction under certain conditions starting with apache pdfbox
> 2.0.18
> -----------------------------------------------------------------------------------
>
> Key: PDFBOX-5090
> URL: https://issues.apache.org/jira/browse/PDFBOX-5090
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.18, 2.0.19, 2.0.20, 2.0.21, 2.0.22
> Environment: jdk 1.8, apache pdfbox, fontbox 2.0.18~, windows 10
> Reporter: sungwon kim
> Priority: Major
> Attachments: 128채널심장전기도시스템을위한3차원매핑소프트웨어개발.pdf,
> 128채널심장전기도시스템을위한3차원매핑소프트웨어개발.txt,
> 128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG,
> textstripper_2.0.17_128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG,
> textstripper_2.0.17_独立財政機関をめぐる論点整理_3p_top.PNG,
> textstripper_2.0.18_128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG,
> textstripper_2.0.18_独立財政機関をめぐる論点整理_3p_top.PNG, 独立財政機関をめぐる論点整理.pdf,
> 独立財政機関をめぐる論点整理_3p_top.PNG
>
>
> When calling PDFTextStripper.getText() function on pdfbox 2.0.18 or later, it
> fails to extract text with any condition.
> It is suspected that the missing text extraction phenomenon is associated
> with either the font type or the font size or text's width and height.
> I have attached the text extraction results of version 2.0.17 and version
> 2.0.18 and the sample data used for the test.
> code
>
> {code:java}
> PDDocument pdDocument = PDDocument.load(new File(path));
> PDFTextStripper stripper = new PDFTextStripper();
> {code}
> dependencies
>
> {code:java}
> <properties>
> <apache.pdfbox.version>2.0.18</apache.pdfbox.version>
> </properties>
> <dependencies>
> <dependency>
> <groupId>org.apache.pdfbox</groupId>
> <artifactId>pdfbox</artifactId>
> <version>${apache.pdfbox.version}</version>
> </dependency>
> <dependency>
> <groupId>org.apache.pdfbox</groupId>
> <artifactId>fontbox</artifactId>
> <version>${apache.pdfbox.version}</version>
> </dependency>
> <dependency>
> <groupId>org.apache.pdfbox</groupId>
> <artifactId>xmpbox</artifactId>
> <version>${apache.pdfbox.version}</version>
> </dependency>
> </dependencies>
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]