[jira] Commented: (PDFBOX-620) Text extract fails on some PDF files but not others...

Villu Ruusmann (JIRA) Sun, 14 Feb 2010 23:49:52 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833708#action_12833708
 ]


Villu Ruusmann commented on PDFBOX-620:
---------------------------------------

You are correct that this is a font encoding issue. All the fonts in file 
"pdf620-fails.pdf" do have explicit encodings set (open the file in Acrobat 
Reader and check "File" -> "Document Properties..." -> "Fonts"), whereas the 
ones in file "pdf620-fails.pdf" do not.

The good news is that PDFBox's Type1C font support has been improved recently. 
If You try out the latest PDFBox 1.0.1-SNAPSHOT (You might need to apply 
PDFBOX-619 to SVN trunk if it is not there yet) this issue should be gone.

Below are my text extraction results:
Dermoapo made 'interactive updates' a key part of its strategy for laun-
ching a new skincare range in a competitive market. The result? Increased 
sales for pharmacies that used the updates.

> Text extract fails on some PDF files but not others...
> ------------------------------------------------------
>
>                 Key: PDFBOX-620
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-620
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.3, 0.8.0-incubator
>         Environment: Tried in Java 5 and 6
>            Reporter: Nicholas Cottrell
>         Attachments: pdf620-fails.pdf, pdf620-works.pdf
>
>
> Having the same problem with 0.7.3, 0.7.4-dev and 0.8.0 - in 0.7.3 I get text 
> with nulls, e.g. "Dermoapo made 'interactive updates' a key part onullits 
> stratenull nullr launnull chinnulla new skincare rannull in a competitive 
> market. nulle resultnullIncreased sales nullr pharmacies that used the 
> updates." while in 0.8.0 it appears as "Dermoapo made 'interactive updates' a 
> key part o?its strate? ?r laun?
> chin?a new skincare ran? in a competitive market. ?e result?Increased 
> sales ?r pharmacies that used the updates." 
> Maybe this is a font problem? Or encoding? I debugged the code in 
> PDFTextStripper and and these appear in the charactersByArticle field even 
> before normalization. 
> In 0.8.0 I get some info logs from the engine:
> SP INFO  20:52:12 PDFStreamEngine - unsupported/disabled operation: re
> SP INFO  20:52:12 PDFStreamEngine - unsupported/disabled operation: W
> SP INFO  20:52:12 PDFStreamEngine - unsupported/disabled operation: n
> SP INFO  20:52:12 PDFStreamEngine - unsupported/disabled operation: cs
> SP INFO  20:52:12 PDFStreamEngine - unsupported/disabled operation: scn
> SP INFO  20:52:12 PDFStreamEngine - unsupported/disabled operation: f
> SP INFO  20:52:12 PDFStreamEngine - unsupported/disabled operation: CS
> SP INFO  20:52:12 PDFStreamEngine - unsupported/disabled operation: SCN
> SP INFO  20:52:12 PDFStreamEngine - unsupported/disabled operation: M
> SP INFO  20:52:12 PDFStreamEngine - unsupported/disabled operation: m
> SP INFO  20:52:12 PDFStreamEngine - unsupported/disabled operation: l
> SP INFO  20:52:12 PDFStreamEngine - unsupported/disabled operation: S
> SP INFO  20:52:12 PDFStreamEngine - unsupported/disabled operation: BDC
> SP INFO  20:52:12 PDFStreamEngine - unsupported/disabled operation: c
> SP INFO  20:52:12 PDFStreamEngine - unsupported/disabled operation: v
> SP INFO  20:52:12 PDFStreamEngine - unsupported/disabled operation: y
> SP INFO  20:52:12 PDFStreamEngine - unsupported/disabled operation: h
> SP INFO  20:52:12 PDFStreamEngine - unsupported/disabled operation: g
> SP INFO  20:52:12 PDFStreamEngine - unsupported/disabled operation: G
> SP INFO  20:52:12 PDFStreamEngine - unsupported/disabled operation: EMC
> I got the same error with icu4j 3.6.1 and 4.2.1

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-620) Text extract fails on some PDF files but not others...

Reply via email to