Re: [poppler] pdftotext feature request: user-specified toUnicode-like tables

Jeff Lerman Tue, 11 Jun 2013 10:49:52 -0700

Thanks!

Regarding your #1: Yes, the embedded fonts I'm seeing often only includea few required symbols. However, so far I have not seen any cases wherea particular character in a particular named font has a differentmapping from one PDF to the next. I am prepared to believe that such athing might happen (for a while I thought it happened often) but so farit seems to be rare. The vast majority of the problems I am seeingcould be addressed by reference to a user-constructed table like I amproposing - and such a table would allow a user to fix problems quicklyfor a set of PDFs that use some obscure font. Note also that I specifythat characters missing from my table should be handled by whateverdefault path pdftotext already uses (characters missing from the tableshould have no effect).

Regarding #2 (pdftohtml solution): Currently, pdftohtml (version 0.22.4)does a poor job of indicating which character in a PDF is in whichfont. The font indicated seems to be more on a per-word basis. Somepdftohtml cleanup is definitely needed there. In the meantime, sincewhat I really want is almost exactly what pdftotext provides, but doinga better job of remapping characters in a way that is font-aware, I'dmuch prefer a solution that allows pdftotext to "do the right thing" forthese fonts, since I already have the mapping info for the cases I caremost about.

Unfortunately, I am not really a C++ programmer, so minor code edits andrebuilds are within my skillset, but significant enhancements/rewritesare not.

If you have PDF examples where a single glyph is represented usingmultiple character codes, that would be interesting to see - but wouldnot be a problem for a remapping algorithm (and I can imagine caseswhere it would happen; in fact it essentially does happen already inUnicode). Many-to-one is easy. One-to-many would obviously beproblematic - are you saying you've seen that too? I thought that wouldbe impossible, assuming a font-aware algorithm.


Thanks,
--Jeff


On 6/11/2013 10:06 AM, Ihar `Philips` Filipau wrote:

Hi!

#1.
You can't make the global per-font table, as you envision it. The
embedded fonts often include only required symbols, meaning that
embedded versions of the same font might and do differ from document
to document - and consequently the character codes do differ too.

#2.
I worked on something similar long time ago. What I did was to modify
the pdftohtml to print the characters of fonts without unicode mapping
as raw codes, in the XML/HTML notation: &#<code>; (I can't remember
right now what trick I used to differentiate the fonts.) Finally,
semi-manually I was replacing the codes with real characters.

If there is a stopgap method by which I could add such info to
Poppler source somewhere and then recompile (hard-coding the table),
please let me know - I'm fine with that for short-term use though I
think a runtime table would be much much more flexible and useful.

I will try to locate my sources.
That would at least give you hints where to plug the tables.
But due to #1, you shouldn't trust too much such automated conversions.

P.S. I have also, seen the effect where single character was whyever
represented with *multiple* character codes. IOW, with some documents
character code -> unicode translation isn't possible, as it would be
leaving some garbage in the document.

On 6/11/13, Jeff Lerman <[email protected]> wrote:

Hi,

This is my first post to the list, and I apologize in advance for any
naivete revealed by my question.  However:

BACKGROUND:
I have a project for which my team is extracting text from a large
number (~100K) of PDF files from scientific publications.  These PDFs
come from a wide variety of sources.  They often use obscure-sounding
fonts for symbols, and those fonts do not seem to include toUnicode data
in the PDFs themselves.  The mapping in these fonts is not obvious and
needs to be determined on a case-by-case (often character-by-character
when the font info is unavailable online) basis.

I have been accumulating my own table of character mappings for those
fonts, focusing on characters of most interest to our team (certain
symbols).  I would like to be able to apply that table during
text-extraction by pdftotext, but I don't see any way to do that
currently.  Since complaints about obscure non-documented font/character
mappings are common online, application of such a table seems like
something that would be of potentially broad interest.

REQUEST:
Ideally, I'd like to be able to take a 3-column table (see below) that I
have built and supply it to pdftotext at runtime.  The table would be
applied in cases where a given character from a given font appears in a
PDF, no toUnicode table is supplied in the PDF, and the character does
appear in the supplied table (characters missing from the table would
continue to be extracted the way pdftotext does it today - i.e.,
characters missing from the table should have no effect).

The table would simply be a tab-delimited 3-column file with:
1. fontname, e.g. AdvP4C4E74 or AdvPi1 or YMath-Pack-Four, but NOT
things like NJIBIE+YMath-Pack-Four
2. font character (could supply an actual character, or a hexadecimal
codepoint)
3. desired Unicode mapping (again - could be an actual character or a
codepoint)

Exact table format isn't a big deal, but the above info is all that
should be needed.

If there is *already* a way to do this in pdftotext, please let me
know.  If there is a stopgap method by which I could add such info to
Poppler source somewhere and then recompile (hard-coding the table),
please let me know - I'm fine with that for short-term use though I
think a runtime table would be much much more flexible and useful.

Thanks!
--Jeff Lerman

_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler



_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Re: [poppler] pdftotext feature request: user-specified toUnicode-like tables

Reply via email to