Ross, i'm not sure i understand this mail at all, you speak about 0.6 branch that is OLD.
Is it related to http://bugs.freedesktop.org/show_bug.cgi?id=20013 ? Albert A Diumenge, 15 de febrer de 2009, Ross Moore va escriure: > Hi Jonathan, > > I think there is a serious problem in Poppler. > However, it's possible that maybe there is something > else wrong on my Mac. > Before reporting it as a bug, would you please confirm > (under Mac OS, or Linux, whatever you have) > that what I say below isn't just local to my system. > > Cheers & thanks, > > Ross > > > Hi Albert, > > This revisits a thread from December 2007, where you > report adding a patch to support /ActualText . > See also: > [Poppler-bugs] [Bug 13573] Poppler does not support ActualText > > > I'm now creating PDFs with /ActualText strings for CJK ideographs. > These strings are given in big-endian UTF-16 format. > Using pdftotext to extract the text, what I find is that: > > a) some, but not all, UTF-16 byte-pairs produce an extractable > character. > > b) Whenever the *first* byte of the pair is in the upper range > 128--255 then the whole character is omitted. > > For example, with the PDF string: (˛ˇt»»tt») > the text extracted using Adobe Reader is 瓈존瓈 > but Poppler produces 珈珈 , which exhibits two errors. > > Firstly, ... > > the portion '»t' has been extracted as '', the empty string, > between the chinese ideographs. > > In alternative representations, this is: > (<FE><FF>t<C8><C8>tt<C8>) producing <E7><8F><88><E7><8F><88> , > where t<C8> representing 't»' extracts to > <E7><8F><88> which is 珈 . > > Secondly, ... > > c) There is an error in the translation of UTF-16 characters > into UTF-8. For example, the above t<C8> should actually > convert in UTF-8 to <E7><93><88> which is 瓈 , > as done by Adobe and other software. > > The <E7><8F><88> is what correctly comes from s<C8> ; > the top-order byte is being mistranslated by -1. > > > Further comments. > > d) little-endian UTF-16 strings are not supported at all. > There's no coding to swap the byte order within an extracted > string. > > Instead the byte-order mark in (ˇ˛»tt»»t) isn't recognised, > so Poppler extracts just the letters 'ttt'. > > > e) octal codes can be used, contrary to a question that I raised > in bug report 20013 . > There my testing was with codes which produced 1st bytes > within the upper range, so the difficulties were the same > as in b) above. _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
