[
https://issues.apache.org/jira/browse/PDFBOX-192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler updated PDFBOX-192:
--------------------------------------
Attachment: PDFBOX191-01_.pdf
PDFBOX191-01_.txt
PDFBOX191-01_1.png
> Find encodings in FontFile3 - CompactFont Format
> ------------------------------------------------
>
> Key: PDFBOX-192
> URL: https://issues.apache.org/jira/browse/PDFBOX-192
> Project: PDFBox
> Issue Type: New Feature
> Components: Text extraction
> Priority: Minor
> Attachments: PDFBOX191-01_.pdf, PDFBOX191-01_.txt, PDFBOX191-01_1.png
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=1545266
> Originally submitted by zlaya_buka on 2006-08-23 06:04.
> Finding encoding problem
> Debugging of a page from the set (uploaded: 01_.pdf)
> showed:
>
> - all the fonts are of the same subtype - Type 1
> - there are no cmaps for any fonts
> - encoding dictionaries for all fonts are practically
> useless - each of the font encoding entries contains
> only differences array (with just one mapping for a
> code that seems not to be used on the page)
>
> I discovered from the source that in such a case
> PDFBox tries to read encoding info from font directly:
>
> COSStream fontFile = (COSStream)
> fontDescriptor.getDictionaryObject(COSName.FONT_FILE);
> if( fontFile != null )
> {
> BufferedReader in = new BufferedReader(new
> InputStreamReader(fontFile.getUnfilteredStream()));
> /**
> * this section parse the FileProgram stream searching
> for a /Encoding entry
> * the research stop if the entry "currentdict end" is
> reach or after 100 lignes
> */
> ...
> }
>
> The problem is that all the fonts on my page are
> marked in their fontdescriptors as FontFile3 - ie are
> in CompactFont Format. It seems from the above code
> that PDFBox parses only COSName.FONT_FILE and ignores
> COSName.FONT_FILE3. So finally I get StandardEncoding
> for all the characters - that's not the case since all
> the pages are in russian.
>
> Is there any chance to find out the solution of
> extracting encoding from compact font - it seems that
> in my case it's the only place where this info can be
> found since Acrobat displays all the files correct
> (TextStripper returns mostly spaces and trash)
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552835&aid=1545266&file_id=190290
> 01.zip (application/x-zip-compressed), 196513 bytes
> Sample file (first page of newspaper, rus) + font report (PDFLib font
> reporter)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.