[ 
https://issues.apache.org/jira/browse/PDFBOX-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sungwon kim updated PDFBOX-5090:
--------------------------------
    Description: 
When calling PDFTextStripper.getText() function on pdfbox 2.0.18 or later, it 
fails to extract text with any condition.

It is suspected that the missing text extraction phenomenon is associated with 
either the font type or the font size or text's width and height.

 I have attached the text extraction results of version 2.0.17 and version 
2.0.18 and the sample data used for the test.

code

 
{code:java}
PDDocument pdDocument = PDDocument.load(new File(path));
PDFTextStripper stripper = new PDFTextStripper();
{code}
dependencies

 
{code:java}
<properties>
    <apache.pdfbox.version>2.0.18</apache.pdfbox.version>
</properties>

<dependencies>
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>pdfbox</artifactId>
        <version>${apache.pdfbox.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>fontbox</artifactId>
        <version>${apache.pdfbox.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>xmpbox</artifactId>
        <version>${apache.pdfbox.version}</version>
    </dependency>
</dependencies>
{code}
 

  was:
When calling PDFTextStripper.getText() function on pdfbox 2.0.18 or later, it 
fails to extract text with any condition.

It is suspected that the missing text extraction phenomenon is associated with 
either the font type or the font size.

 I have attached the text extraction results of version 2.0.17 and version 
2.0.18 and the sample data used for the test.

code

 
{code:java}
PDDocument pdDocument = PDDocument.load(new File(path));
PDFTextStripper stripper = new PDFTextStripper();
{code}
dependencies

 
{code:java}
<properties>
    <apache.pdfbox.version>2.0.18</apache.pdfbox.version>
</properties>

<dependencies>
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>pdfbox</artifactId>
        <version>${apache.pdfbox.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>fontbox</artifactId>
        <version>${apache.pdfbox.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>xmpbox</artifactId>
        <version>${apache.pdfbox.version}</version>
    </dependency>
</dependencies>
{code}
 


> Missing text extraction under certain conditions starting with apache pdfbox 
> 2.0.18
> -----------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5090
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5090
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.18, 2.0.19, 2.0.20, 2.0.21, 2.0.22
>         Environment: jdk 1.8, apache pdfbox, fontbox 2.0.18~, windows 10
>            Reporter: sungwon kim
>            Priority: Major
>         Attachments: 128채널심장전기도시스템을위한3차원매핑소프트웨어개발.pdf, 
> 128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG, 
> textstripper_2.0.17_128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG, 
> textstripper_2.0.17_独立財政機関をめぐる論点整理_3p_top.PNG, 
> textstripper_2.0.18_128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG, 
> textstripper_2.0.18_独立財政機関をめぐる論点整理_3p_top.PNG, 独立財政機関をめぐる論点整理.pdf, 
> 独立財政機関をめぐる論点整理_3p_top.PNG
>
>
> When calling PDFTextStripper.getText() function on pdfbox 2.0.18 or later, it 
> fails to extract text with any condition.
> It is suspected that the missing text extraction phenomenon is associated 
> with either the font type or the font size or text's width and height.
>  I have attached the text extraction results of version 2.0.17 and version 
> 2.0.18 and the sample data used for the test.
> code
>  
> {code:java}
> PDDocument pdDocument = PDDocument.load(new File(path));
> PDFTextStripper stripper = new PDFTextStripper();
> {code}
> dependencies
>  
> {code:java}
> <properties>
>     <apache.pdfbox.version>2.0.18</apache.pdfbox.version>
> </properties>
> <dependencies>
>     <dependency>
>         <groupId>org.apache.pdfbox</groupId>
>         <artifactId>pdfbox</artifactId>
>         <version>${apache.pdfbox.version}</version>
>     </dependency>
>     <dependency>
>         <groupId>org.apache.pdfbox</groupId>
>         <artifactId>fontbox</artifactId>
>         <version>${apache.pdfbox.version}</version>
>     </dependency>
>     <dependency>
>         <groupId>org.apache.pdfbox</groupId>
>         <artifactId>xmpbox</artifactId>
>         <version>${apache.pdfbox.version}</version>
>     </dependency>
> </dependencies>
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to