Hi! #1. You can't make the global per-font table, as you envision it. The embedded fonts often include only required symbols, meaning that embedded versions of the same font might and do differ from document to document - and consequently the character codes do differ too.
#2. I worked on something similar long time ago. What I did was to modify the pdftohtml to print the characters of fonts without unicode mapping as raw codes, in the XML/HTML notation: &#<code>; (I can't remember right now what trick I used to differentiate the fonts.) Finally, semi-manually I was replacing the codes with real characters. > If there is a stopgap method by which I could add such info to > Poppler source somewhere and then recompile (hard-coding the table), > please let me know - I'm fine with that for short-term use though I > think a runtime table would be much much more flexible and useful. I will try to locate my sources. That would at least give you hints where to plug the tables. But due to #1, you shouldn't trust too much such automated conversions. P.S. I have also, seen the effect where single character was whyever represented with *multiple* character codes. IOW, with some documents character code -> unicode translation isn't possible, as it would be leaving some garbage in the document. On 6/11/13, Jeff Lerman <[email protected]> wrote: > Hi, > > This is my first post to the list, and I apologize in advance for any > naivete revealed by my question. However: > > BACKGROUND: > I have a project for which my team is extracting text from a large > number (~100K) of PDF files from scientific publications. These PDFs > come from a wide variety of sources. They often use obscure-sounding > fonts for symbols, and those fonts do not seem to include toUnicode data > in the PDFs themselves. The mapping in these fonts is not obvious and > needs to be determined on a case-by-case (often character-by-character > when the font info is unavailable online) basis. > > I have been accumulating my own table of character mappings for those > fonts, focusing on characters of most interest to our team (certain > symbols). I would like to be able to apply that table during > text-extraction by pdftotext, but I don't see any way to do that > currently. Since complaints about obscure non-documented font/character > mappings are common online, application of such a table seems like > something that would be of potentially broad interest. > > REQUEST: > Ideally, I'd like to be able to take a 3-column table (see below) that I > have built and supply it to pdftotext at runtime. The table would be > applied in cases where a given character from a given font appears in a > PDF, no toUnicode table is supplied in the PDF, and the character does > appear in the supplied table (characters missing from the table would > continue to be extracted the way pdftotext does it today - i.e., > characters missing from the table should have no effect). > > The table would simply be a tab-delimited 3-column file with: > 1. fontname, e.g. AdvP4C4E74 or AdvPi1 or YMath-Pack-Four, but NOT > things like NJIBIE+YMath-Pack-Four > 2. font character (could supply an actual character, or a hexadecimal > codepoint) > 3. desired Unicode mapping (again - could be an actual character or a > codepoint) > > Exact table format isn't a big deal, but the above info is all that > should be needed. > > If there is *already* a way to do this in pdftotext, please let me > know. If there is a stopgap method by which I could add such info to > Poppler source somewhere and then recompile (hard-coding the table), > please let me know - I'm fine with that for short-term use though I > think a runtime table would be much much more flexible and useful. > > Thanks! > --Jeff Lerman > > _______________________________________________ > poppler mailing list > [email protected] > http://lists.freedesktop.org/mailman/listinfo/poppler > -- Don't walk behind me, I may not lead. Don't walk in front of me, I may not follow. Just walk beside me and be my friend. -- Albert Camus (attributed to) _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
