On Fri, 2011-03-25 at 20:43 +0000, Albert Astals Cid wrote: > A Divendres, 25 de març de 2011, vàreu escriure: > > On Fri, 25 Mar 2011 19:02:46 +0000, Albert Astals Cid <[email protected]> > > > > NB I just tried extracting from a Word-generated PDF and TextOutputDev > > didn't see the line with the diacritic at all. > > And are you sure it's not a Word fault?
(What tool do you use to de-compress/analyse PDFs?) Here's the PDF file generated with Word 2010: http://users.ecs.soton.ac.uk/tdb2/ms_word_accents.pdf The bad text is this (only contains two diacritics but word has chewed the whole paragraph): [<005A03580003003E>-4<0182>5<01C1011E>-3<0176>3<0003>9<01020176>4<011A>3<0003001 1035800030057>4<017D>-5<016F0190019A>10<011E018C>] TJ There's a CMAP included: /CMapName /Adobe-Identity-UCS def /CMapType 2 def 1 begincodespacerange <0000> <FFFF> endcodespacerange 17 beginbfchar <0003> <0020> <0011> <0020> ... repeated for all chars above mapping to 0020 <0358> <0020> endbfchar endcmap CMapName currentdict /CMap defineresource pop Do I understand correctly that Word is encoding all paragraphs containing diacritics using a custom font table with a Unicode cmap that maps every character to space (which is exactly the behaviour shown by Chrome copy-n-paste)? All the best, Tim. _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
