[jira] [Comment Edited] (PDFBOX-4793) Questionable fallback font for some embedded chinese fonts

Christian Appl (Jira) Fri, 06 Mar 2020 02:30:25 -0800


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053251#comment-17053251
 ]


Christian Appl edited comment on PDFBOX-4793 at 3/6/20, 10:29 AM:
------------------------------------------------------------------

Sorry for reopening.

Hmmm - I'm trying to understand the math here... Maybe I am completly off and 
simply don't understand how this works, but something seems odd to me.

 !screenshot-9.png! 

Those are the values, I am reading from the font Malgun Gothic. According to 
the link you shared 
(https://docs.microsoft.com/en-us/typography/opentype/spec/os2#cpr) the 
ulCodePageRange1 Bits 0–31and ulCodePageRange2 Bits 32–63 are relevant for 
determining which languages are supported by a MS font.

According to fontDrop those values are:
ulCodePageRange1 : 524289 (‭1000 0000 0000 0000 0001‬)
ulCodePageRange2 : 0

Using parts of your code:
{code:java}
public static void main(String... args){
        long ulCodePageRange1 = 524289;

        long JIS_JAPAN = 1 << 17;
        long CHINESE_SIMPLIFIED = 1 << 18;
        long KOREAN_WANSUNG = 1 << 19;
        long CHINESE_TRADITIONAL = 1 << 20;
        long KOREAN_JOHAB = 1 << 21;

        System.out.println((ulCodePageRange1 & KOREAN_WANSUNG) == 
KOREAN_WANSUNG);
        System.out.println((ulCodePageRange1 & KOREAN_JOHAB) == KOREAN_JOHAB);
        System.out.println((ulCodePageRange1 & CHINESE_SIMPLIFIED) == 
CHINESE_SIMPLIFIED);
}
{code}

Resulting in the output:
 !screenshot-8.png! 

Which is exactly evaluating to: It is indeed a korean and not a chinese font.
I currently don't know how the variable "long codePageRange" is determined 
exactly and I don't know, if I am thinking to simple here... but is this 
statement:
bq. // PDFBOX-4793 and PDF.js 10699: This font has only Korean, but has bits 
17-21 set.
really true?

As far as I can see this evaluation should work fine. Is the correct value used 
for "codePageRange"?

Sorry for being pushy :)


was (Author: capsvd):
Sorry for reopening.

Hmmm - I'm trying to understand the math here... Maybe I am completly off and 
simply don't understand how this works, but something seems odd to me.

 !screenshot-9.png! 

Those are the values, I am reading from the font Malgun Gothic. According to 
the link you shared 
(https://docs.microsoft.com/en-us/typography/opentype/spec/os2#cpr) the 
ulCodePageRange1 Bits 0–31and ulCodePageRange2 Bits 32–63 are relevant for 
determining which languages are supported by a MS font.

According to fontDrop those values are:
ulCodePageRange1 : 524289 (‭1000 0000 0000 0000 0001‬)
ulCodePageRange2 : 0

Using parts of your code:
{code:java}
public static void main(String... args){
        long ulCodePageRange1 = 524289;

        long JIS_JAPAN = 1 << 17;
        long CHINESE_SIMPLIFIED = 1 << 18;
        long KOREAN_WANSUNG = 1 << 19;
        long CHINESE_TRADITIONAL = 1 << 20;
        long KOREAN_JOHAB = 1 << 21;

        System.out.println((ulCodePageRange1 & KOREAN_WANSUNG) == 
KOREAN_WANSUNG);
        System.out.println((ulCodePageRange1 & KOREAN_JOHAB) == KOREAN_JOHAB);
        System.out.println((ulCodePageRange1 & CHINESE_SIMPLIFIED) == 
CHINESE_SIMPLIFIED);
}
{code}

Resulting in the output:
 !screenshot-8.png! 

Which is exactly evaluating to: It is indeed a korean and not a chinese font.
I currently don't know how the variable "long codePageRange" is determined 
exactly and I don't know, if I am thinking to simple here... but is this 
statement:
// PDFBOX-4793 and PDF.js 10699: This font has only Korean, but has bits 17-21 
set.
really true?

As far as I can see this evaluation should work fine. Is the correct value used 
for "codePageRange"?

Sorry for being pushy :)

> Questionable fallback font for some embedded chinese fonts
> ----------------------------------------------------------
>
>                 Key: PDFBOX-4793
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4793
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Rendering
>    Affects Versions: 2.0.18, 2.0.19
>            Reporter: Christian Appl
>            Assignee: Tilman Hausherr
>            Priority: Major
>             Fix For: 2.0.20, 3.0.0 PDFBox
>
>         Attachments: PDFJS-10699.pdf, image-2020-03-04-09-49-42-323.png, 
> image-2020-03-04-09-58-01-055.png, image-2020-03-04-10-09-25-343.png, 
> image-2020-03-04-10-31-03-065.png, pdf_font-zhcn.pdf, screenshot-2.png, 
> screenshot-3.png, screenshot-4.png, screenshot-5.png, screenshot-6.png, 
> screenshot-7.png, screenshot-8.png, screenshot-9.png
>
>
> *Issue:*
> I tried to render PDFs, that contain embedded chinese fonts. Neither the PDF 
> Debugger, nor printouts of the document (PDFPrintable), nor the PDFRenderer 
> can display/render the chinese glyphs correctly and will render placeholders 
> instead.
> *Assumptions:*
> I assume, that said embedded fonts are incomplete and don't contain all 
> glyphs, that would be required to render the text properly and therefore 
> PDFbox attempts to use the previously determined fallback font. (!?)
>  !image-2020-03-04-09-49-42-323.png! 
>  !image-2020-03-04-09-58-01-055.png! 
> And fails to find the glyphs in said fallback font.
> Which is not surprising, as the Fallback font "MalgunGothic-Semilight" 
> (Windows standard font) does not contain chinese characters.
>  !image-2020-03-04-10-09-25-343.png! 
> *Debugging:*
> I tried to understand how the fallback font is determined and what could be 
> done to solve this problem on my end. But I was unable to find a satisfying 
> solution.
> My best guess so far is, that the CIDFontMapping (FontMapperImpl) is to blame 
> for determining an unfit fallback font.
> Although it seems to check, whether required codepages are contained in a 
> fallback font, it still does rank the Malgun font as the topscorer and best 
> substitute font, even though it does clearly not contain all required 
> codepages.
> *My opinion:*
> This is troubling, as better fit fonts exist and could have been selected. 
> (ie.: Adobe Stong Std) And are indeed included in the CIDFontMapping, but 
> seemingly are scoring lower for some reason.
> *Further information:*
> I can not disclose the document in question, however I found a document 
> (pdf_font-zhcn.pdf) in another issue (PDFBOX-3132), that can be used to 
> reproduce the issue (ie.: by dropping it into the PDF Debugger)
>  !image-2020-03-04-10-31-03-065.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-4793) Questionable fallback font for some embedded chinese fonts

Reply via email to