Re: [poppler] pdftotext feature request: user-specified toUnicode-like tables

Ihar `Philips` Filipau Tue, 11 Jun 2013 10:07:38 -0700

Hi!

#1.
You can't make the global per-font table, as you envision it. The
embedded fonts often include only required symbols, meaning that
embedded versions of the same font might and do differ from document
to document - and consequently the character codes do differ too.


#2.
I worked on something similar long time ago. What I did was to modify
the pdftohtml to print the characters of fonts without unicode mapping
as raw codes, in the XML/HTML notation: &#<code>; (I can't remember
right now what trick I used to differentiate the fonts.) Finally,
semi-manually I was replacing the codes with real characters.


> If there is a stopgap method by which I could add such info to
> Poppler source somewhere and then recompile (hard-coding the table),
> please let me know - I'm fine with that for short-term use though I
> think a runtime table would be much much more flexible and useful.

I will try to locate my sources.
That would at least give you hints where to plug the tables.
But due to #1, you shouldn't trust too much such automated conversions.

P.S. I have also, seen the effect where single character was whyever
represented with *multiple* character codes. IOW, with some documents
character code -> unicode translation isn't possible, as it would be
leaving some garbage in the document.

On 6/11/13, Jeff Lerman <[email protected]> wrote:
> Hi,
>
> This is my first post to the list, and I apologize in advance for any
> naivete revealed by my question.  However:
>
> BACKGROUND:
> I have a project for which my team is extracting text from a large
> number (~100K) of PDF files from scientific publications.  These PDFs
> come from a wide variety of sources.  They often use obscure-sounding
> fonts for symbols, and those fonts do not seem to include toUnicode data
> in the PDFs themselves.  The mapping in these fonts is not obvious and
> needs to be determined on a case-by-case (often character-by-character
> when the font info is unavailable online) basis.
>
> I have been accumulating my own table of character mappings for those
> fonts, focusing on characters of most interest to our team (certain
> symbols).  I would like to be able to apply that table during
> text-extraction by pdftotext, but I don't see any way to do that
> currently.  Since complaints about obscure non-documented font/character
> mappings are common online, application of such a table seems like
> something that would be of potentially broad interest.
>
> REQUEST:
> Ideally, I'd like to be able to take a 3-column table (see below) that I
> have built and supply it to pdftotext at runtime.  The table would be
> applied in cases where a given character from a given font appears in a
> PDF, no toUnicode table is supplied in the PDF, and the character does
> appear in the supplied table (characters missing from the table would
> continue to be extracted the way pdftotext does it today - i.e.,
> characters missing from the table should have no effect).
>
> The table would simply be a tab-delimited 3-column file with:
> 1. fontname, e.g. AdvP4C4E74 or AdvPi1 or YMath-Pack-Four, but NOT
> things like NJIBIE+YMath-Pack-Four
> 2. font character (could supply an actual character, or a hexadecimal
> codepoint)
> 3. desired Unicode mapping (again - could be an actual character or a
> codepoint)
>
> Exact table format isn't a big deal, but the above info is all that
> should be needed.
>
> If there is *already* a way to do this in pdftotext, please let me
> know.  If there is a stopgap method by which I could add such info to
> Poppler source somewhere and then recompile (hard-coding the table),
> please let me know - I'm fine with that for short-term use though I
> think a runtime table would be much much more flexible and useful.
>
> Thanks!
> --Jeff Lerman
>
> _______________________________________________
> poppler mailing list
> [email protected]
> http://lists.freedesktop.org/mailman/listinfo/poppler
>


-- 
Don't walk behind me, I may not lead.
Don't walk in front of me, I may not follow.
Just walk beside me and be my friend.
    -- Albert Camus (attributed to)
_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Re: [poppler] pdftotext feature request: user-specified toUnicode-like tables

Reply via email to