Re: [poppler] pdftotext feature request: user-specified toUnicode-like tables

Jeff Lerman Tue, 11 Jun 2013 16:40:58 -0700


On 6/11/2013 3:34 PM, Ihar `Philips` Filipau wrote:

On 6/11/13, Jeff Lerman <[email protected]> wrote:

Yes, indicating words is an advantage - but failing to indicate that a
given character in a word is in a given font is a bug.

This is about right time to tell you the thing: focus of poppler is
the on-screen representation of the PDF, not helping extracting
information from the PDFs. Otherwise, year ago I would have flooded
the place with patches. :D

On the one hand, fair point - on the other hand, pdftotext is includedin poppler, and as Albert has pointed out, things get better when theyget worked on.

To paint a character, one doesn't need to know its Unicode - the raw
code point is an index of the font's glyphs/etc for the character. The
Unicode of a character is only needed for copy-pasting. (Some PDF
software intentionally strips the Unicode mapping tables to make
copy-paste/text extraction unusable.)


Otherwise, for you it is worth googling "pdf2htmlEX" and/or
"pdftohtmlEX". Search for the precise terms. There are several project
on the net (one of them is definitely based on poppler) focusing on
extracting text/etc from PDFs into HTML, with high level of fidelity.
Probably that would be more helpful to you than forcing poppler do
something it is not designed to do.

I'm looking up those packages now, and have checked out pdf2htmlEX.However, HTML is an unnecessary intermediate for our purposes. Also, Idon't see evidence (happy to be corrected here) that pdf2htmlEX containsa table of mappings for obscure fonts, or a way for users to specifytheir own mapping tables. Without that, it doesn't help me.

Now, when I use pdftohtml (I'll include the actual command below too), I
get a file that includes:
  .....
as you can see, the font "MathematicalPi-One" is not noted as being the
correct one for that numeral "1".  There is no way to find out the
actual fonts being used, on a per-character basis, for the text in the
PDF file.

That what I meant by saying that pdftohtml erroneously merges some fonts.
But this is not per se a bug. Conversion of PDF into an HTML is at
best approximate process, primarily optimized to display an average
PDF in a readable fashion.

I would say that "erroneously merges" is indeed a bug. (Unless thedocumentation specifies that behavior - in which case it might be afeature, but not a great one, I'd suggest). If the mis-assignment offonts is done in pdftotext as well, and if it is done before conversionto Unicode, then my request (even if I choose to try coding it myself)will be blocked, since the per-character font information is lost(discarded) by pdftotext. Does anyone have any information on whetherthat is the case?

... In fact, modulo "Tagged PDF" feature, PDF is not designed torepresent text, per se. The most common PDF is just a container withvector graphics. Some of the graphics is drawing of text. Extractionof text is literally, based on interception of a text drawingoperation and instead of drawing the text, dumping it into a file/etc.@Leonard, please don't hit me. /me *cowers*. :D

Yes, I know that PDF was not originally designed to representmachine-readable text, and instead is essentially optimized forhuman-readability only. However, the fact is that in today's world someof us must extract text from PDFs anyway.

I stipulate that I am aware of the problem and that I am forced to faceit anyway.

Hmm, OK.  I'm a little concerned, looking at the code, that assumptions
about how to map a character from a given font are made on a whole-font
basis, not per-character.
I'm not sure if there is support for fallback
mechanisms in the algorithms that convert a PDF character to Unicode for
pdftotext.  For example, if a document has font X and I know that
character A in that font should be remapped to Z, but I have no
information on some other character B, I want to be able to specify the
A->Z remapping without affecting whatever default is used to show the B
character.  I'm not sure if the code simply looks for the existence of a
certain kind of translation table for each font and then assumes that
the table is always complete - that would be sub-optimal for my
use-case.  Can someone shed light on that question?

The toUnicode table is per-font. But, for example normal, bold, italic
and bold+italic fonts are 4 different fonts. That is why the merge is
needed for HTML.

There should be already a place to hook the Unicode mapping table,
because there is already place in code (I've seen it once) which
extracts from PDF the font specific Unicode mapping table.

OK, here I will begin to delve into the code itself, at least to scopethe problem. This is important to our team and we're willing to devoteresources to solving the problem and contributing code if people arewilling to answer questions about aspects of the existing Poppler code.Happy to take some questions offline if there are particular folks whohave the necessary expertise.


I think my feature request could be broken down like this:

1. Pass a file containing a custom set of font-specific "ToUnicode"mappings to pdftotext.2. Ensure that pdftotext is correctly parsing and preserving, for eachPDF character, the name of the font used (in the PDF) to represent it,as well as the character number.3. Edit the appropriate code (in GfxFont.cc?) to "patch" thecharacter-to-Unicode mapping using table supplied in step 1. Themapping should be used, probably, as if it came from a "ToUnicode" tablesupplied in the PDF.

At line 1182 in GfxFont.cc, I see the comment "// merge differences intoencoding" (preceding a code block) which seems to be where a "patch"table such as the one I'm proposing should be utilized. However, I havequestions:

a. It looks like that code-block might only be used for certain fonts(it's in Gfx8BitFont::Gfx8BitFont). Is that true? If so, is there ananalogous block for each of the other possible font-types?b. Is there any danger of the "patches" being ignored for certain fontsbased on the "choose a cmap" logic outlined later? In my copy there isa block of comments that begin:


  // To match up with the Adobe-defined behaviour, we choose a cmap
  // like this:
  // 1. If the PDF font has an encoding:
...

Any takers for any of the above steps 1-3?

Finally: I think allowing users to specify this kind ofcharacter-mapping table manually at runtime would contribute enormouslyto the usefulness of pdftotext. At the moment, the character-mappingissue is the most prominent problem with using an otherwise very usefulprogram (THANK YOU to all who have been building/maintaining thissuite!), and it's frustrating to have the correct mappings alreadyin-hand, but not have a way to tell pdftotext to incorporate them. Ifpdftotext *could* accept such a table, there would be strong incentiveto crowdsource additions to the table, since they are of broadinterest. Since collecting the data for the table is quitelabor-intensive, incentivizing the work by allowing it to be easilyutilized would itself be very influential.


Thanks very much,
--Jeff

Regards.

N.B. PDFs might have attachments. In the past, I once came across a
PDF without the font encodings - but with the source WinWord document
attached. Worth checking.



_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Re: [poppler] pdftotext feature request: user-specified toUnicode-like tables

Reply via email to