On 6/11/2013 3:34 PM, Ihar `Philips` Filipau wrote:
On 6/11/13, Jeff Lerman <[email protected]> wrote:
Yes, indicating words is an advantage - but failing to indicate that a
given character in a word is in a given font is a bug.
This is about right time to tell you the thing: focus of poppler is
the on-screen representation of the PDF, not helping extracting
information from the PDFs. Otherwise, year ago I would have flooded
the place with patches. :D
On the one hand, fair point - on the other hand, pdftotext is included in poppler, and as Albert has pointed out, things get better when they get worked on.
To paint a character, one doesn't need to know its Unicode - the raw
code point is an index of the font's glyphs/etc for the character. The
Unicode of a character is only needed for copy-pasting. (Some PDF
software intentionally strips the Unicode mapping tables to make
copy-paste/text extraction unusable.)

Otherwise, for you it is worth googling "pdf2htmlEX" and/or
"pdftohtmlEX". Search for the precise terms. There are several project
on the net (one of them is definitely based on poppler) focusing on
extracting text/etc from PDFs into HTML, with high level of fidelity.
Probably that would be more helpful to you than forcing poppler do
something it is not designed to do.
I'm looking up those packages now, and have checked out pdf2htmlEX. However, HTML is an unnecessary intermediate for our purposes. Also, I don't see evidence (happy to be corrected here) that pdf2htmlEX contains a table of mappings for obscure fonts, or a way for users to specify their own mapping tables. Without that, it doesn't help me.
Now, when I use pdftohtml (I'll include the actual command below too), I
get a file that includes:
  .....
as you can see, the font "MathematicalPi-One" is not noted as being the
correct one for that numeral "1".  There is no way to find out the
actual fonts being used, on a per-character basis, for the text in the
PDF file.
That what I meant by saying that pdftohtml erroneously merges some fonts.
But this is not per se a bug. Conversion of PDF into an HTML is at
best approximate process, primarily optimized to display an average
PDF in a readable fashion.
I would say that "erroneously merges" is indeed a bug. (Unless the documentation specifies that behavior - in which case it might be a feature, but not a great one, I'd suggest). If the mis-assignment of fonts is done in pdftotext as well, and if it is done before conversion to Unicode, then my request (even if I choose to try coding it myself) will be blocked, since the per-character font information is lost (discarded) by pdftotext. Does anyone have any information on whether that is the case?
... In fact, modulo "Tagged PDF" feature, PDF is not designed to represent text, per se. The most common PDF is just a container with vector graphics. Some of the graphics is drawing of text. Extraction of text is literally, based on interception of a text drawing operation and instead of drawing the text, dumping it into a file/etc. @Leonard, please don't hit me. /me *cowers*. :D
Yes, I know that PDF was not originally designed to represent machine-readable text, and instead is essentially optimized for human-readability only. However, the fact is that in today's world some of us must extract text from PDFs anyway.

I stipulate that I am aware of the problem and that I am forced to face it anyway.
Hmm, OK.  I'm a little concerned, looking at the code, that assumptions
about how to map a character from a given font are made on a whole-font
basis, not per-character.
I'm not sure if there is support for fallback
mechanisms in the algorithms that convert a PDF character to Unicode for
pdftotext.  For example, if a document has font X and I know that
character A in that font should be remapped to Z, but I have no
information on some other character B, I want to be able to specify the
A->Z remapping without affecting whatever default is used to show the B
character.  I'm not sure if the code simply looks for the existence of a
certain kind of translation table for each font and then assumes that
the table is always complete - that would be sub-optimal for my
use-case.  Can someone shed light on that question?
The toUnicode table is per-font. But, for example normal, bold, italic
and bold+italic fonts are 4 different fonts. That is why the merge is
needed for HTML.

There should be already a place to hook the Unicode mapping table,
because there is already place in code (I've seen it once) which
extracts from PDF the font specific Unicode mapping table.


OK, here I will begin to delve into the code itself, at least to scope the problem. This is important to our team and we're willing to devote resources to solving the problem and contributing code if people are willing to answer questions about aspects of the existing Poppler code. Happy to take some questions offline if there are particular folks who have the necessary expertise.

I think my feature request could be broken down like this:

1. Pass a file containing a custom set of font-specific "ToUnicode" mappings to pdftotext. 2. Ensure that pdftotext is correctly parsing and preserving, for each PDF character, the name of the font used (in the PDF) to represent it, as well as the character number. 3. Edit the appropriate code (in GfxFont.cc?) to "patch" the character-to-Unicode mapping using table supplied in step 1. The mapping should be used, probably, as if it came from a "ToUnicode" table supplied in the PDF.

At line 1182 in GfxFont.cc, I see the comment "// merge differences into encoding" (preceding a code block) which seems to be where a "patch" table such as the one I'm proposing should be utilized. However, I have questions:

a. It looks like that code-block might only be used for certain fonts (it's in Gfx8BitFont::Gfx8BitFont). Is that true? If so, is there an analogous block for each of the other possible font-types? b. Is there any danger of the "patches" being ignored for certain fonts based on the "choose a cmap" logic outlined later? In my copy there is a block of comments that begin:

  // To match up with the Adobe-defined behaviour, we choose a cmap
  // like this:
  // 1. If the PDF font has an encoding:
...

Any takers for any of the above steps 1-3?

Finally: I think allowing users to specify this kind of character-mapping table manually at runtime would contribute enormously to the usefulness of pdftotext. At the moment, the character-mapping issue is the most prominent problem with using an otherwise very useful program (THANK YOU to all who have been building/maintaining this suite!), and it's frustrating to have the correct mappings already in-hand, but not have a way to tell pdftotext to incorporate them. If pdftotext *could* accept such a table, there would be strong incentive to crowdsource additions to the table, since they are of broad interest. Since collecting the data for the table is quite labor-intensive, incentivizing the work by allowing it to be easily utilized would itself be very influential.

Thanks very much,
--Jeff

Regards.

N.B. PDFs might have attachments. In the past, I once came across a
PDF without the font encodings - but with the source WinWord document
attached. Worth checking.




_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to