Re: [Koha-devel] Diacriticals, Unicode, and PDF's

Chris Nighswonger Tue, 29 Sep 2009 06:04:53 -0700

Hi Joe,

On Mon, Sep 28, 2009 at 11:50 PM, Joe Atzberger <ohioc...@gmail.com> wrote:

> The problem is not really with Koha, it is with the PDF format.

Very definitely so.

> I worked on this a while back, and concluded it will not be possible to
> cleanly solve without serious trade-offs:
>
>    - controlling more aspects of the process, like requiring specific
>    fonts on the user's system, or
>
>
This is an inevitable fact of PDF creation in any application.

>
>    -
>    - dramatic increase in filesize (orders of magnitude larger),
>
>
Using zip compression to compress the PDF stream helps with this problem.
Nearly all PDF's produced by *well known* PDF creating apps compress the
stream

>
>    - embedding the font in the PDF,
>
>
This is the only option to create truly portable PDF's. Otherwise, as you
say, we have to "require" the reading user to install certain fonts in order
to view the PDF.

>
>    - or producing effectively page-sized images, or
>
>
This is not a realistic option, imho.

>
>    - non-free PDF components, or not supporting common versions of Acrobat
>    Reader, or
>
>
We should attempt to maintain the widest possible compatibility, although it
may not be possible to please all of the readers all of the time.

>
>    -
>    - custom character set conversion into ASCII (as much as possible),
>    i.e. data loss.
>
>
Yuk.... ;-)

>
>    -
>
> The CPAN modules had really quite poor APIs for dealing with Unicode data.
> Any of the available methods would require heavy overhaul of the code and
> the approach to labels in general.
>

Both of the main CPAN modules (PDF::Reuse and PDF::API2) do not offer
anywhere near a full implementation of the ISO32000-1 standard. And they are
lacking in some basic areas making the creation of good quality PDF's nearly
impossible in their current state.

> In my opinion, development time might be better spent on piping the data
> into an external known good UNICODE-capable print tool or something like
> Open Office.

I agree that this should be at least given some consideration. However, if
we go that way, we would simply be "requiring" another piece of software for
the end user rather than "requiring" fonts. (In the final analysis there
will always be a minimum level of required packages in order to run Koha.)

> Generating PDFs out of (FOSS) perl just didn't seem to be a viable answer.
>

Maybe.

>
> I would be interested to see any counter-examples with FOSS perl producing
> compact, cross-platform PDFs with some UTF-8 data like Chinese, or
> Lithuanian... that don't require specific fonts.
>

The problem is a bit more fundamental than FOSS code. In order to properly
display fonts in a PDF, there are only two options: either 1. embed the font
in the PDF stream or 2. require the reader to have installed the correct
fonts on their system.

If we have issues with requiring the installation of fonts, then we must
take option 1 if we are going to produce our own PDFs. If we opt to pipe to
another app, then we must require the installtion of those apps on the
creator's system. And we are still bound by the above two options as they
are fundamental to the PDF standard. The "another app" will have to either
embed or require.

The entire issue is a gnarly one to be sure. Even Adobe acknowledges this
fact. Using another app to produce the PDFs *may* be the easy way out. Of
course we will always be dependent on that app to do our work for us.
However, I do not see how we can get away from having to require unicode
fonts in order to produce printible documents containing unicode characters.
The so called "standard 14 fonts" (Courier, Times-Roman and Helvetica
families) only support characters in the ASCII range. They do not support
Unicode codes such as 8230 (*…*), 8364 (*€*), etc. And therefore will only
work when the unicode and gliphID chance to agree.

I personally think that the solution to implimenting the ability in Koha to
produce PDFs directly may not be as difficult as it appears initially.
Having said that, I may find myself taking Joe's position before its all
said and done. :-)

Kind Regards,
Chris

>
> --Joe
>
> 2009/9/28 Nathan Gray <kolib...@graystudios.org>
>
>>  On Mon, Sep 28, 2009 at 09:21:39PM -0400, Chris Nighswonger wrote:
>> > The UTF to PDF conversion issue appears to be primarily caused by the
>> > fact that the PDF stream uses glyphIDs rather than unicode to display
>> > strings. Thus there is not a direct, one-to-one unicode-gliphID
>> > relationship. The reason that *some* unicode chars come across ok is
>> > more ascribable to chance than to design. This happens when the
>> > unicode *happens* to match the font gliphID. What really should be
>> > happening is that there should be a "ToUnicode" table built and
>> > embedded in the PDF file so that the relationship from unicode to
>> > gliphID may be properly defined.
>>
>> [snip]
>>
>> > Any thoughts, information, suggestions, etc. is most gratefully
>> appreciated.
>>
>> The cairographics project has done a lot of work on PDFs and text
>> to glyph translation, if I remember correctly.
>>
>>  http://cairographics.org
>>
>> A google search with these terms is a good start:
>>
>>  cairo graphics pdf text to glyph
>>
>> It looks like they rely on pango libraries (something called
>> pangocairo in particular).
>>
>> -kolibrie
>>
>> http://lists.koha.org/mailman/listinfo/koha-devel
>>
>
>

_______________________________________________
Koha-devel mailing list
Koha-devel@lists.koha.org
http://lists.koha.org/mailman/listinfo/koha-devel

Re: [Koha-devel] Diacriticals, Unicode, and PDF's

Reply via email to