[poppler] pdftotext feature request: user-specified toUnicode-like tables

Jeff Lerman Tue, 11 Jun 2013 09:52:07 -0700

Hi,

This is my first post to the list, and I apologize in advance for anynaivete revealed by my question. However:


BACKGROUND:

I have a project for which my team is extracting text from a largenumber (~100K) of PDF files from scientific publications. These PDFscome from a wide variety of sources. They often use obscure-soundingfonts for symbols, and those fonts do not seem to include toUnicode datain the PDFs themselves. The mapping in these fonts is not obvious andneeds to be determined on a case-by-case (often character-by-characterwhen the font info is unavailable online) basis.

I have been accumulating my own table of character mappings for thosefonts, focusing on characters of most interest to our team (certainsymbols). I would like to be able to apply that table duringtext-extraction by pdftotext, but I don't see any way to do thatcurrently. Since complaints about obscure non-documented font/charactermappings are common online, application of such a table seems likesomething that would be of potentially broad interest.


REQUEST:

Ideally, I'd like to be able to take a 3-column table (see below) that Ihave built and supply it to pdftotext at runtime. The table would beapplied in cases where a given character from a given font appears in aPDF, no toUnicode table is supplied in the PDF, and the character doesappear in the supplied table (characters missing from the table wouldcontinue to be extracted the way pdftotext does it today - i.e.,characters missing from the table should have no effect).


The table would simply be a tab-delimited 3-column file with:

1. fontname, e.g. AdvP4C4E74 or AdvPi1 or YMath-Pack-Four, but NOTthings like NJIBIE+YMath-Pack-Four2. font character (could supply an actual character, or a hexadecimalcodepoint)3. desired Unicode mapping (again - could be an actual character or acodepoint)

Exact table format isn't a big deal, but the above info is all thatshould be needed.

If there is *already* a way to do this in pdftotext, please let meknow. If there is a stopgap method by which I could add such info toPoppler source somewhere and then recompile (hard-coding the table),please let me know - I'm fine with that for short-term use though Ithink a runtime table would be much much more flexible and useful.


Thanks!
--Jeff Lerman

_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

[poppler] pdftotext feature request: user-specified toUnicode-like tables

Reply via email to