On 9 May 2011, at 11:22, Tim Brody wrote:

> On Sat, 2011-05-07 at 19:02 +0100, Jonathan Kew wrote:
>> On 7 May 2011, at 17:43, Albert Astals Cid wrote:
>> 
>>> A Friday, April 01, 2011, Albert Astals Cid va escriure:
>>>> A Divendres, 1 d'abril de 2011, Tim Brody va escriure:
>>>>> On Thu, 31 Mar 2011 23:28:02 +0100, Albert Astals Cid <[email protected]>
>>>>> 
>>>>> wrote:
>>>>>> A Dimecres, 30 de març de 2011, vàreu escriure:
>>>>>>> On Tue, 2011-03-29 at 22:45 +0100, Albert Astals Cid wrote:
>>>>>>>>>> I still get
>>>>>>>>>> 
>>>>>>>>>> -R. L¨wen and B. Polster
>>>>>>>>>> -o
>>>>>>>>>> +R. Lowen and B. Polster
>>>>>>>>>> 
>>>>>>>>>> Maybe you sent a old version of the patch? Can anyone confirm if
>>>>>> 
>>>>>> My bad, somehow vi/diff/less are showing me o but if i open it in kate
>>>>>> i see
>>>>>> an ö
>>>>> 
>>>>> That will be because it's separate characters (X + combining char). You
>>>>> could normalise with unicodeNormalizeNFKC but I thought it probably
>>>>> better to leave text - as far as possible - unchanged from the PDF
>>>>> source.
>>>> 
>>>> Hmmmmmm, since we are already changing the "real" representation of the
>>>> text (i.e transforming it from broken to not broken), i think i prefer one
>>>> that is easy to use (i.e. shows ö in most of the tools), what do others
>>>> think?
>>> 
>>> Since the others are not there, please do what i want and output a real ö
>> 
>> If you're going to apply a Unicode normalization process, please use
>> NFC rather than NFKC. This will deal with creating precomposed
>> letter+accent combinations, but avoids introducing "compatibility"
>> changes that may lose significant distinctions in the text.
> 
> For reference:
> NFC = pre-composed
> NFKC = pre-composed plus simplified ligatures ('fi' => 'f'+'i')

NFKC will do much more than that; for example, mapping super- and subscripted 
letters to their "unstyled" counterparts. Even the ™ symbol is mapped to "TM". 
I don't think this would be desirable here.

(See http://minaret.info/test/normalize.msp to experiment with normalization 
forms.)

> I agree but there isn't an NFC in poppler. It seems a waste of time to
> be writing one from scratch in Poppler or is there really no Unicode
> library that provides normalisations?

The obvious example is ICU, but I doubt poppler wants to pull in a dependency 
on that. Though if it's not for core poppler but just a particular 
(poppler-based) tool, perhaps it's not such a bad idea.

glib also supports it, but may not be readily available everywhere that people 
want to use poppler.

A much smaller lib that includes an NFC function is TECkit 
<http://scripts.sil.org/teckit> (disclaimer: that was a project of mine), 
though it is not actively maintained these days, and could do with an update 
for Unicode 6.0. But the code is there, and updating the Unicode tables would 
be simple.

If poppler already has support for NFKC, I would expect it to be easy to 
support NFC as well - essentially, it just means using a subset of the 
decomposition tables. But I haven't actually looked at the code to see how this 
would work in practice.

JK

_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to