On Mon, 9 May 2011 19:52:38 +0100, Albert Astals Cid <[email protected]> wrote: > A Monday, May 09, 2011, Tim Brody va escriure: >> On Sat, 2011-05-07 at 19:02 +0100, Jonathan Kew wrote: >> > On 7 May 2011, at 17:43, Albert Astals Cid wrote: >> > > A Friday, April 01, 2011, Albert Astals Cid va escriure: >> > >> A Divendres, 1 d'abril de 2011, Tim Brody va escriure: >> > >>> On Thu, 31 Mar 2011 23:28:02 +0100, Albert Astals Cid >> > >>> <[email protected]> >> > >>> >> > >>> wrote: >> > >>>> A Dimecres, 30 de març de 2011, vàreu escriure: >> > >>>>> On Tue, 2011-03-29 at 22:45 +0100, Albert Astals Cid wrote: >> > >>>>>>>> I still get >> > >>>>>>>> >> > >>>>>>>> -R. L¨wen and B. Polster >> > >>>>>>>> -o >> > >>>>>>>> +R. Lowen and B. Polster >> > >>>>>>>> >> > >>>>>>>> Maybe you sent a old version of the patch? Can anyone confirm >> > >>>>>>>> if >> > >>>> >> > >>>> My bad, somehow vi/diff/less are showing me o but if i open it in >> > >>>> kate i see >> > >>>> an ö >> > >>> >> > >>> That will be because it's separate characters (X + combining char). >> > >>> You could normalise with unicodeNormalizeNFKC but I thought it >> > >>> probably better to leave text - as far as possible - unchanged from >> > >>> the PDF source. >> > >> >> > >> Hmmmmmm, since we are already changing the "real" representation of >> > >> the text (i.e transforming it from broken to not broken), i think i >> > >> prefer one that is easy to use (i.e. shows ö in most of the tools), >> > >> what do others think? >> > > >> > > Since the others are not there, please do what i want and output a >> > > real >> > > ö >> > >> > If you're going to apply a Unicode normalization process, please use >> > >> > NFC rather than NFKC. This will deal with creating precomposed >> > letter+accent combinations, but avoids introducing "compatibility" >> > changes that may lose significant distinctions in the text. >> >> For reference: >> NFC = pre-composed >> NFKC = pre-composed plus simplified ligatures ('fi' => 'f'+'i') >> >> I agree but there isn't an NFC in poppler. It seems a waste of time to >> be writing one from scratch in Poppler or is there really no Unicode >> library that provides normalisations? > > Couldn't you have said that (we have no code to compose stuff) when I asked > the list if we wanted composed or not?
poppler has a NFKC implementation which is used by the internal word-search. I haven't found an indication of what version the Unicode tables are. > Sincerely i am quite hesitant to apply your patch since it "breaks" > pdftotext > usage in the console (since it seems most of the apps in the console are > not > able to understand the non-composed form) Your initial response to fixing-up LaTeX generated PDFs was "fix LaTeX" but now you're saying we should make poppler work around broken shell tools? :-) Anyway, my patch is only a fix-up of overprinting characters that would otherwise get mangled by pfdtotext. It just makes it more apparent that your tool-chain is broken because it's producing more non-ASCII7 code-points. I agree that pdftotext should by default output NFC but you need to decide whether to implement an NFC against the out of date poppler tables or link to icu. -- All the best, Tim. _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
