[ 
https://issues.apache.org/jira/browse/PDFBOX-192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-192:
--------------------------------------

    Attachment: PDFBOX191-01_.pdf
                PDFBOX191-01_.txt
                PDFBOX191-01_1.png

> Find encodings in FontFile3 - CompactFont Format
> ------------------------------------------------
>
>                 Key: PDFBOX-192
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-192
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Text extraction
>            Priority: Minor
>         Attachments: PDFBOX191-01_.pdf, PDFBOX191-01_.txt, PDFBOX191-01_1.png
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=1545266
> Originally submitted by zlaya_buka on 2006-08-23 06:04.
> Finding encoding problem
> Debugging of a page from the set (uploaded: 01_.pdf) 
> showed: 
>  
> - all the fonts are of the same subtype - Type 1 
> - there are no cmaps for any fonts 
> - encoding dictionaries for all fonts are practically 
> useless - each of the font encoding entries contains 
> only differences array (with just one mapping for a 
> code that seems not to be used on the page) 
>  
> I discovered from the source that in such a case 
> PDFBox tries to read encoding info from font directly: 
>  
> COSStream fontFile = (COSStream)
> fontDescriptor.getDictionaryObject(COSName.FONT_FILE); 
> if( fontFile != null ) 
> { 
> BufferedReader in = new BufferedReader(new 
> InputStreamReader(fontFile.getUnfilteredStream())); 
> /** 
> * this section parse the FileProgram stream searching 
> for a /Encoding entry 
> * the research stop if the entry "currentdict end" is 
> reach or after 100 lignes 
> */ 
> ... 
> } 
>  
> The problem is that all the fonts on my page are 
> marked in their fontdescriptors as FontFile3 - ie are 
> in CompactFont Format. It seems from the above code 
> that PDFBox parses only COSName.FONT_FILE and ignores 
> COSName.FONT_FILE3. So finally I get StandardEncoding 
> for all the characters - that's not the case since all 
> the pages are in russian. 
>  
> Is there any chance to find out the solution of 
> extracting encoding from compact font - it seems that 
> in my case it's the only place where this info can be 
> found since Acrobat displays all the files correct 
> (TextStripper returns mostly spaces and trash) 
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552835&aid=1545266&file_id=190290
> 01.zip (application/x-zip-compressed), 196513 bytes
> Sample file (first page of newspaper, rus) + font report (PDFLib font 
> reporter)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to