[Koha-devel] Diacriticals, Unicode, and PDF's
Does someone have a few bibs they can shoot my way which contain lots of diacriticals and are unicode encoding? Maybe something in French or Spanish for starters. I'm working toward fixing the unicode problems with labels as a back-burner project. Kind Regards, Chris ___ Koha-devel mailing list Koha-devel@lists.koha.org http://lists.koha.org/mailman/listinfo/koha-devel
Re: [Koha-devel] Diacriticals, Unicode, and PDF's
On 2009/09/29, at 12:23 PM, Chris Nighswonger wrote: Does someone have a few bibs they can shoot my way which contain lots of diacriticals and are unicode encoding? Maybe something in French or Spanish for starters. I'm working toward fixing the unicode problems with labels as a back-burner project. Kind Regards, Chris great stuff Chris, i really appreciate someone tidying up my kludgey code :) so - i'm curious... is there a newer/better way to get around the less-than-perfect character-conversion issues with UTF to PDF, that were discussed on the lists in the last year or so (approx) Mason ___ Koha-devel mailing list Koha-devel@lists.koha.org http://lists.koha.org/mailman/listinfo/koha-devel
Re: [Koha-devel] Diacriticals, Unicode, and PDF's
Hi Mason, On Mon, Sep 28, 2009 at 7:40 PM, Mason James mason.loves.su...@gmail.com wrote: so - i'm curious... is there a newer/better way to get around the less-than-perfect character-conversion issues with UTF to PDF, that were discussed on the lists in the last year or so (approx) The UTF to PDF conversion issue appears to be primarily caused by the fact that the PDF stream uses glyphIDs rather than unicode to display strings. Thus there is not a direct, one-to-one unicode-gliphID relationship. The reason that *some* unicode chars come across ok is more ascribable to chance than to design. This happens when the unicode *happens* to match the font gliphID. What really should be happening is that there should be a ToUnicode table built and embedded in the PDF file so that the relationship from unicode to gliphID may be properly defined. Logically, the next question is: How is this to be accomplished? The answer is: I have no concrete idea atm. I *think* that the first issue at hand is that the standard 14 fonts do not extend far enough into the unicode char set to be usable afaict. So we will need to use fonts which do. (ie. gnu freefonts http://www.gnu.org/software/freefont/) The second issue is to understand how ISO32000-1 defines building a ToUnicode CMap (sect 9.10.3) and grind out some code to construct these (probably more modifications to PDF::Reuse: I have made a number already which the maintainer has agreed to include in the next release toward the end of October). It may be as simple as embedding unicode ttf's in the PDF file. If that is the case, the code for that is already in place in both PDF::Reuse and PDF::API2. I'm not convinced that the solution is anywhere near that simple or it would have been done by now. But this is all somewhat subject to sudden and dramatic change as I'm still very much on the learning PDF learning curve and could be way off target. I have had some correspondence with an individual who is a platform architect at Adobe and who has kindly offered to help clarify any questions regarding unicode and PDF. Any thoughts, information, suggestions, etc. is most gratefully appreciated. Kind Regards, Chris ___ Koha-devel mailing list Koha-devel@lists.koha.org http://lists.koha.org/mailman/listinfo/koha-devel
Re: [Koha-devel] Diacriticals, Unicode, and PDF's
The problem is not really with Koha, it is with the PDF format. I worked on this a while back, and concluded it will not be possible to cleanly solve without serious trade-offs: - controlling more aspects of the process, like requiring specific fonts on the user's system, or - dramatic increase in filesize (orders of magnitude larger), embedding the font in the PDF, or producing effectively page-sized images, or - non-free PDF components, or not supporting common versions of Acrobat Reader, or - custom character set conversion into ASCII (as much as possible), i.e. data loss. The CPAN modules had really quite poor APIs for dealing with Unicode data. Any of the available methods would require heavy overhaul of the code and the approach to labels in general. In my opinion, development time might be better spent on piping the data into an external known good UNICODE-capable print tool or something like Open Office. Generating PDFs out of (FOSS) perl just didn't seem to be a viable answer. I would be interested to see any counter-examples with FOSS perl producing compact, cross-platform PDFs with some UTF-8 data like Chinese, or Lithuanian... that don't require specific fonts. --Joe 2009/9/28 Nathan Gray kolib...@graystudios.org On Mon, Sep 28, 2009 at 09:21:39PM -0400, Chris Nighswonger wrote: The UTF to PDF conversion issue appears to be primarily caused by the fact that the PDF stream uses glyphIDs rather than unicode to display strings. Thus there is not a direct, one-to-one unicode-gliphID relationship. The reason that *some* unicode chars come across ok is more ascribable to chance than to design. This happens when the unicode *happens* to match the font gliphID. What really should be happening is that there should be a ToUnicode table built and embedded in the PDF file so that the relationship from unicode to gliphID may be properly defined. [snip] Any thoughts, information, suggestions, etc. is most gratefully appreciated. The cairographics project has done a lot of work on PDFs and text to glyph translation, if I remember correctly. http://cairographics.org A google search with these terms is a good start: cairo graphics pdf text to glyph It looks like they rely on pango libraries (something called pangocairo in particular). -kolibrie http://lists.koha.org/mailman/listinfo/koha-devel ___ Koha-devel mailing list Koha-devel@lists.koha.org http://lists.koha.org/mailman/listinfo/koha-devel