[jira] [Commented] (PDFBOX-3255) Reasonable way to handle missing characters in font

Christian Brandt (JIRA) Wed, 02 Mar 2016 09:34:18 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176044#comment-15176044
 ]


Christian Brandt commented on PDFBOX-3255:
------------------------------------------

Hi!

I ended up having the following routine:

{code}
private void setStringValue(PDField field, String input) throws Exception
{
        /* Extract font name */
        String  da      = field.getCOSObject().getString(COSName.DA.getName());
        Matcher m       = Pattern.compile("/?(.*) [\\d]+ Tf.*", 
Pattern.CASE_INSENSITIVE).matcher(da);
        String  name    = m.find() ? m.group(1) : null;
        PDFont  font    = 
field.getAcroForm().getDefaultResources().getFont(COSName.getPDFName(name));

        if (font instanceof PDSimpleFont)
        {
                /* Walk through used characters and replace ones with space 
that can not be represented by the font */
                StringBuilder value = new StringBuilder();

                Encoding encoding = ((PDSimpleFont) font).getEncoding();

                for (int i=0;i<input.length();i++)
                {
                        char c = input.charAt(i);

                        if (".notdef".equals(encoding.getName(c)) == false)
                                value.append(c);
                        else
                                value.append(' ');
                }

                field.setValue(value.toString());
        }
        else
                field.setValue(input);
}
{code}

Despite the obvious performance issues, this seems to work at least with the 
test cases I tried. However,

1. It would be nice to use 
PDVariableText.getDefaultAppearanceString().getFont() to get the associated 
font instead of parsing the name manually and then fetching it from the 
resources, but the method is not accessible. Now I am just not sure if my regex 
covers all the possible cases.
2. Because the Encoding.contains('\u00AD') may return true (value ".notdef" 
seems to be stored), a string comparison is required which is not nice. This 
can be of course optimized a bit by the caller with lookup for recurring 
characters, but it would make life easier if we could get rid of the whole 
string comparison.

> Reasonable way to handle missing characters in font
> ---------------------------------------------------
>
>                 Key: PDFBOX-3255
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3255
>             Project: PDFBox
>          Issue Type: Wish
>          Components: AcroForm
>    Affects Versions: 2.0.0
>            Reporter: Christian Brandt
>              Labels: newbie
>         Attachments: TEST.pdf
>
>
> Hello,
> We have an issue with setting form field values if the input contains 
> characters that cannot be rendered with the associated font. The system 
> throws similar exception to:
> java.lang.IllegalArgumentException: U+0308 ('dieresiscmb') is not available 
> in this font's encoding: MacRomanEncoding with differences
> Currently this is problematic to be handled outside the framework because 
> based on my understanding (please correct me if I'm wrong) the caller does 
> not have a way to figure out what font will be eventually used and therefore 
> which characters are not renderable.
> What we would ultimately like, is that the library would optionally replace 
> unrenderable characters with some another existing character (e.g. space) 
> instead of failing the call, or that the library would provide a way to 
> recover from this error so that the user would be able to call the method 
> again with altered input. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3255) Reasonable way to handle missing characters in font

Reply via email to