Character encoding in a PDF

Kirk Brooks Mon, 21 Nov 2016 15:59:08 -0800

Hi all,
I've been working with the material Neil Dennis presented at the Summit for
building PDFs directly in code. In general I'm really pleased with the
results I'm getting. Neil's demo database provides the basic tools you need
to be able to build a PDF document. I used them as a starting point for a
couple of reports I need to be able to produce directly into PDFs. It is a
little counter-intuitive in many ways but after dinking with it for a few
hours I was able to start actually building things I could use. It turns
out to be much like building reports using Print form but with fewer of the
limitations. Most notably unlike Print form you can traverse the document
in almost any order. So you can determine the bottom of page (by 'printing'
down to it), move on to the next page or two and come back to the first
page to add something - like the ending page. Pretty cool. And the
documents are created as BLOBs so they can be stored, attached to emails as
is or written to disk.


This is good for me because it resolves various issues I had regarding
producing these documents using the standard printer-translation approach.

The "but" comes in the form of some character encoding issues. This is not
an area of strength for me so this may be pitifully simple but some
characters don't encode correctly in the PDF text even though 4D is happy
to deal with them in text fields. Things like 'smart quotes' and em dashes
are the first things to popup.

The first thing I did was take a look at what the characters are by using
Character code. Here are some examples of what I found:

¿ :  0x00BF  questiondown
" :  0x0022  quotedbl
„ :  0x201E  quotedblbase
“ :  0x201C  quotedblleft
” :  0x201D  quotedblright
‘ :  0x2018  quoteleft
’ :  0x2019  quoteright
‚ :  0x201A  quotesinglbase
' :  0x0027  quotesingle

The numbers are the UTF-16 char codes returned by Character code (in hex).
OK. What can I do with that?

The easy kludge is to simply replace them with ASCII equivalents. (ie. stop
the bleeding.) But that's an unsatisfying solution. Plus it seems kinda
sloppy since I copied those code samples from the PDF Reference
<http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf>
PDF (pp 999) and these very characters are referenced on pp 1005. I'm
wondering if something happens in the blob manipulation during
construction. The PDF is built in blobs and there are some conversions of
blob to text and text to blob along the way. Or this may be an issue of not
having complete font information available. Are there any PDF facil readers
who can offer some insight here?

Thanks

-- 
Kirk Brooks
San Francisco, CA
=======================
**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[email protected]
**********************************************************************

Character encoding in a PDF

Reply via email to