On 19 Dec 2007, at 12:06 pm, Adrian Johnson wrote: > I've created a test file to test the patch > > http://annarchy.freedesktop.org/~ajohnson/test.pdf > > The numbers "1", "2", and "3", are mapped to the text "test", "text", > and "the". The "Z" has the glyph name "g1" so it should be ignored > when > extracting text. > > I have found a bug in the code. With the test file I get > > $ pdftotext test.pdf - > Error: Could not parse charref for nameToUnicode: g1 > This is = test of text extr=?tion using the glyph n=mes > > The output should be: > This is a test of text extraction using the glyph names > > It looks like the glyph names "u00061" and "u0063" are not decoded > correctly.
To be more specific, it looks as though the names are being interpreted as decimal rather than hexadecimal. Could it be that some implementations of sscanf require an 0x prefix to scan hex, and otherwise treat the value as decimal? JK _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
