[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

Tilman Hausherr (Jira) Wed, 27 Jan 2021 21:47:13 -0800


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17273334#comment-17273334
 ]


Tilman Hausherr commented on PDFBOX-5090:
-----------------------------------------

In a first step, the code 49077 is mapped into cid 2516 (0x9d4). In the next 
step, this value is searched in the Adobe-Korea1-UCS2 table but isn't found.
{noformat}
<09CF> <09D4> <C5FC>
{noformat}
because a part of this range (<09CF> <09D4>) is ignored. Before the two 
commits, a range of 6 elements were used, after the commits, only 4 so that the 
last result is C5FF, see comment by [~mkl] in the related issue.

I'm wondering whether Adobe uses a more "relaxed" approach for built-in tables?

I also remember something about ranges being 2-dimensional sometimes?!

> Missing text extraction under certain conditions starting with apache pdfbox 
> 2.0.18
> -----------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5090
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5090
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.18, 2.0.19, 2.0.20, 2.0.21, 2.0.22
>         Environment: jdk 1.8, apache pdfbox, fontbox 2.0.18~, windows 10
>            Reporter: sungwon kim
>            Priority: Major
>              Labels: regression
>         Attachments: 128채널심장전기도시스템을위한3차원매핑소프트웨어개발.pdf, 
> 128채널심장전기도시스템을위한3차원매핑소프트웨어개발.txt, 
> 128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG, PDFBOX-5090_reduced.pdf, 
> textstripper_2.0.17_128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG, 
> textstripper_2.0.17_独立財政機関をめぐる論点整理_3p_top.PNG, 
> textstripper_2.0.18_128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG, 
> textstripper_2.0.18_独立財政機関をめぐる論点整理_3p_top.PNG, 独立財政機関をめぐる論点整理.pdf, 
> 独立財政機関をめぐる論点整理_3p_top.PNG
>
>
> When calling PDFTextStripper.getText() function on pdfbox 2.0.18 or later, it 
> fails to extract text with any condition.
> It is suspected that the missing text extraction phenomenon is associated 
> with either the font type or the font size or text's width and height.
>  I have attached the text extraction results of version 2.0.17 and version 
> 2.0.18 and the sample data used for the test.
> code
>  
> {code:java}
> PDDocument pdDocument = PDDocument.load(new File(path));
> PDFTextStripper stripper = new PDFTextStripper();
> {code}
> dependencies
>  
> {code:java}
> <properties>
>     <apache.pdfbox.version>2.0.18</apache.pdfbox.version>
> </properties>
> <dependencies>
>     <dependency>
>         <groupId>org.apache.pdfbox</groupId>
>         <artifactId>pdfbox</artifactId>
>         <version>${apache.pdfbox.version}</version>
>     </dependency>
>     <dependency>
>         <groupId>org.apache.pdfbox</groupId>
>         <artifactId>fontbox</artifactId>
>         <version>${apache.pdfbox.version}</version>
>     </dependency>
>     <dependency>
>         <groupId>org.apache.pdfbox</groupId>
>         <artifactId>xmpbox</artifactId>
>         <version>${apache.pdfbox.version}</version>
>     </dependency>
> </dependencies>
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

Reply via email to