[Koha-devel] Diacriticals, Unicode, and PDF's

2009-09-28 Thread Chris Nighswonger
Does someone have a few bibs they can shoot my way which contain lots
of diacriticals and are unicode encoding? Maybe something in French or
Spanish for starters. I'm working toward fixing the unicode problems
with labels as a back-burner project.

Kind Regards,
Chris
___
Koha-devel mailing list
Koha-devel@lists.koha.org
http://lists.koha.org/mailman/listinfo/koha-devel


Re: [Koha-devel] Diacriticals, Unicode, and PDF's

2009-09-28 Thread Mason James

On 2009/09/29, at 12:23 PM, Chris Nighswonger wrote:

 Does someone have a few bibs they can shoot my way which contain lots
 of diacriticals and are unicode encoding? Maybe something in French or
 Spanish for starters. I'm working toward fixing the unicode problems
 with labels as a back-burner project.

 Kind Regards,
 Chris


great stuff Chris, i really appreciate someone tidying up my kludgey  
code :)


so - i'm curious... is there a newer/better way to get around the  
less-than-perfect character-conversion issues with UTF to PDF, that  
were discussed on the lists in the last year or so (approx)

Mason


___
Koha-devel mailing list
Koha-devel@lists.koha.org
http://lists.koha.org/mailman/listinfo/koha-devel


Re: [Koha-devel] Diacriticals, Unicode, and PDF's

2009-09-28 Thread Chris Nighswonger
Hi Mason,

On Mon, Sep 28, 2009 at 7:40 PM, Mason James
mason.loves.su...@gmail.com wrote:

 so - i'm curious... is there a newer/better way to get around the
 less-than-perfect character-conversion issues with UTF to PDF, that were
 discussed on the lists in the last year or so (approx)

The UTF to PDF conversion issue appears to be primarily caused by the
fact that the PDF stream uses glyphIDs rather than unicode to display
strings. Thus there is not a direct, one-to-one unicode-gliphID
relationship. The reason that *some* unicode chars come across ok is
more ascribable to chance than to design. This happens when the
unicode *happens* to match the font gliphID. What really should be
happening is that there should be a ToUnicode table built and
embedded in the PDF file so that the relationship from unicode to
gliphID may be properly defined.

Logically, the next question is: How is this to be accomplished?

The answer is: I have no concrete idea atm.

I *think* that the first issue at hand is that the standard 14 fonts
do not extend far enough into the unicode char set to be usable
afaict. So we will need to use fonts which do. (ie. gnu freefonts
http://www.gnu.org/software/freefont/)

The second issue is to understand how ISO32000-1 defines building a
ToUnicode CMap (sect 9.10.3) and grind out some code to construct
these (probably more modifications to PDF::Reuse: I have made a number
already which the maintainer has agreed to include in the next release
toward the end of October). It may be as simple as embedding unicode
ttf's in the PDF file. If that is the case, the code for that is
already in place in both PDF::Reuse and PDF::API2. I'm not convinced
that the solution is anywhere near that simple or it would have been
done by now.

But this is all somewhat subject to sudden and dramatic change as I'm
still very much on the learning PDF learning curve and could be way
off target.

I have had some correspondence with an individual who is a platform
architect at Adobe and who has kindly offered to help clarify any
questions regarding unicode and PDF.

Any thoughts, information, suggestions, etc. is most gratefully appreciated.

Kind Regards,
Chris
___
Koha-devel mailing list
Koha-devel@lists.koha.org
http://lists.koha.org/mailman/listinfo/koha-devel


Re: [Koha-devel] Diacriticals, Unicode, and PDF's

2009-09-28 Thread Joe Atzberger
The problem is not really with Koha, it is with the PDF format.  I worked on
this a while back, and concluded it will not be possible to cleanly solve
without serious trade-offs:

   - controlling more aspects of the process, like requiring specific fonts
   on the user's system, or
   - dramatic increase in filesize (orders of magnitude larger), embedding
   the font in the PDF, or producing effectively page-sized images, or
   - non-free PDF components, or not supporting common versions of Acrobat
   Reader, or
   - custom character set conversion into ASCII (as much as possible), i.e.
   data loss.

The CPAN modules had really quite poor APIs for dealing with Unicode data.
Any of the available methods would require heavy overhaul of the code and
the approach to labels in general.

In my opinion, development time might be better spent on piping the data
into an external known good UNICODE-capable print tool or something like
Open Office.  Generating PDFs out of (FOSS) perl just didn't seem to be a
viable answer.

I would be interested to see any counter-examples with FOSS perl producing
compact, cross-platform PDFs with some UTF-8 data like Chinese, or
Lithuanian... that don't require specific fonts.

--Joe

2009/9/28 Nathan Gray kolib...@graystudios.org

 On Mon, Sep 28, 2009 at 09:21:39PM -0400, Chris Nighswonger wrote:
  The UTF to PDF conversion issue appears to be primarily caused by the
  fact that the PDF stream uses glyphIDs rather than unicode to display
  strings. Thus there is not a direct, one-to-one unicode-gliphID
  relationship. The reason that *some* unicode chars come across ok is
  more ascribable to chance than to design. This happens when the
  unicode *happens* to match the font gliphID. What really should be
  happening is that there should be a ToUnicode table built and
  embedded in the PDF file so that the relationship from unicode to
  gliphID may be properly defined.

 [snip]

  Any thoughts, information, suggestions, etc. is most gratefully
 appreciated.

 The cairographics project has done a lot of work on PDFs and text
 to glyph translation, if I remember correctly.

  http://cairographics.org

 A google search with these terms is a good start:

  cairo graphics pdf text to glyph

 It looks like they rely on pango libraries (something called
 pangocairo in particular).

 -kolibrie

 http://lists.koha.org/mailman/listinfo/koha-devel

___
Koha-devel mailing list
Koha-devel@lists.koha.org
http://lists.koha.org/mailman/listinfo/koha-devel