sungwon kim created PDFBOX-5090:
-----------------------------------

             Summary: Missing text extraction under certain conditions starting 
with apache pdfbox 2.0.18
                 Key: PDFBOX-5090
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5090
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 2.0.22, 2.0.21, 2.0.20, 2.0.19, 2.0.18
         Environment: jdk 1.8, apache pdfbox, fontbox 2.0.18~, windows 10
            Reporter: sungwon kim
         Attachments: 128채널심장전기도시스템을위한3차원매핑소프트웨어개발.pdf, 
128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG, 
textstripper_2.0.17_128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG, 
textstripper_2.0.17_独立財政機関をめぐる論点整理_3p_top.PNG, 
textstripper_2.0.18_128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG, 
textstripper_2.0.18_独立財政機関をめぐる論点整理_3p_top.PNG, 独立財政機関をめぐる論点整理.pdf, 
独立財政機関をめぐる論点整理_3p_top.PNG

When calling PDFTextStripper.getText() function on pdfbox 2.0.18 or later, it 
fails to extract text with any condition.

It is suspected that the missing text extraction phenomenon is associated with 
either the font type or the font size.

 I have attached the text extraction results of version 2.0.17 and version 
2.0.18 and the sample data used for the test.

code

 
{code:java}
PDDocument pdDocument = PDDocument.load(new File(path));
PDFTextStripper stripper = new PDFTextStripper();
{code}
dependencies

 
{code:java}
<properties>
    <apache.pdfbox.version>2.0.18</apache.pdfbox.version>
</properties>

<dependencies>
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>pdfbox</artifactId>
        <version>${apache.pdfbox.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>fontbox</artifactId>
        <version>${apache.pdfbox.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>xmpbox</artifactId>
        <version>${apache.pdfbox.version}</version>
    </dependency>
</dependencies>
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to