Oleksii Zinkovskyi created PDFBOX-4036:
------------------------------------------

             Summary: Invalid ToUnicode CMap in font
                 Key: PDFBOX-4036
                 URL: https://issues.apache.org/jira/browse/PDFBOX-4036
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 2.0.8, 2.0.4
         Environment: Windows 10 64 bit, STS 3.9.1, JDK 1.8.0_152, Gradle
            Reporter: Oleksii Zinkovskyi
         Attachments: CSTA17.pdf

While calling textStripper.getText(document) on the attached PDF file to 
extract text and save it to .txt, I receive following warnings:

{quote}Dec 15, 2017 8:53:22 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font UYQXWX+MaterialIcons-Regular
Dec 15, 2017 8:53:22 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+380 (380) in font 
UYQXWX+MaterialIcons-Regular
Dec 15, 2017 8:53:22 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+381 (381) in font 
UYQXWX+MaterialIcons-Regular
Dec 15, 2017 8:53:25 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font FANHRS+MaterialIcons-Regular
Dec 15, 2017 8:53:25 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+380 (380) in font 
FANHRS+MaterialIcons-Regular
Dec 15, 2017 8:53:25 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+381 (381) in font 
FANHRS+MaterialIcons-Regular{quote}

In the end the file is generated and properly saved, but some letters are 
missing (like "ft" in "software" or "ff" in "different"). So far I've tested 
close to 10 files and this is the only problematic item I've found. Depending 
on what program I use to view the resulting .txt file, I either get blank 
spaces (Notepad) or "NUL" values (Notepad++) in place of the missing letters. 
What's more, some editors (Sublime Text Editor) outright refuse to open the 
file and view it as unreadable/corrupted byte code. Suffice to say working with 
such a file is somewhat difficult...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to