On 6/11/2013 1:43 PM, Ihar `Philips` Filipau wrote:
On 6/11/13, Jeff Lerman <[email protected]> wrote:
Regarding #2 (pdftohtml solution): Currently, pdftohtml (version 0.22.4)
does a poor job of indicating which character in a PDF is in which
font.
I've been using pdftohtml almost exclusively with the "-xml" option.
There you have the font id. (But additional hack was required, since
the font comparison is pretty lax and erroneously merges similar
fonts, even if one has toUnicode table and the other doesn't.)

The font indicated seems to be more on a per-word basis.  Some
pdftohtml cleanup is definitely needed there.
In my case that was an advantage, since later the output was fed to
another program to recover some structure from the documents. Text
reassembly is easy in comparison and was required anyway to reverse
effects of, for example, text justification.
Yes, indicating words is an advantage - but failing to indicate that a given character in a word is in a given font is a bug. An example might help. To extract the true information, I am using a trial version of PDFlib TET. I ask it for an XML representation of my PDF, using --tetml wordplus - which basically shows each word AND shows information for each character in the word. Here is a snippet of what I get:

 <Word>
  <Text>3849110KbC</Text>
  <Box llx="482.25" lly="438.00" urx="533.75" ury="447.00">
   <Glyph font="F0" size="9" x="482.25" y="438.00" width="4.50">3</Glyph>
   <Glyph font="F0" size="9" x="486.75" y="438.00" width="4.50">8</Glyph>
   <Glyph font="F0" size="9" x="491.25" y="438.00" width="4.50">4</Glyph>
   <Glyph font="F0" size="9" x="495.75" y="438.00" width="4.50">9</Glyph>
   <Glyph font="F3" size="9" x="500.25" y="438.00" width="7.50">1</Glyph>
   <Glyph font="F0" size="9" x="507.75" y="438.00" width="4.50">1</Glyph>
   <Glyph font="F0" size="9" x="512.25" y="438.00" width="4.50">0</Glyph>
   <Glyph font="F0" size="9" x="516.75" y="438.00" width="6.50">K</Glyph>
   <Glyph font="F0" size="9" x="523.25" y="438.00" width="4.50">b</Glyph>
   <Glyph font="F0" size="9" x="527.75" y="438.00" width="6.00">C</Glyph>
  </Box>
 </Word>

Note that the first numeral "1" is in font "F3" but the other characters in that word are in font "F0". In this case, "F3" is the font "MathematicalPi-One". In that font, the character encoded as "1" actually has a glyph that looks like a plus sign. (Yuck - but I digress.)

The actual PDF, displayed by Acrobat Reader, shows this word as "3849+10KbC".

Now, when I use pdftohtml (I'll include the actual command below too), I get a file that includes:

pdftohtml -xml -fontfullname -s -i MYFILE.pdf

    <fontspec id="7" size="11" family="Times-Bold" color="#000000"/>
    <fontspec id="8" size="11" family="Times-Roman" color="#000000"/>
    .
    .
    .
<text top="532" left="68" width="756" height="12" font="8">males, 40 females, mean age 40616 years), 45 suffering from idiopathic chronic pancreatitis and 54 from acute recurrent pancreatitis.</text> <text top="549" left="81" width="742" height="12" font="7"><b>Methods. </b>Each subject was screened for the 18 CFTR mutations: DF508, DI507, R1162X, 2183AA.G, 21303K, 3849110KbC.T,</text> <text top="565" left="67" width="756" height="12" font="8">G542X, 1717-1G.A, R553X, Q552X, G85E, 71115G.A, 3132delTG, 278915G.A, W1282X, R117H, R347P, R352Q), which cover</text>

as you can see, the font "MathematicalPi-One" is not noted as being the correct one for that numeral "1". There is no way to find out the actual fonts being used, on a per-character basis, for the text in the PDF file. Of course, pdftotext -bbox provides no font info at all.

So, that's what I mean about pdftohtml being buggy - it provides an unreliable indication of which font was used for each character.

Unfortunately, I am not really a C++ programmer, so minor code edits and
rebuilds are within my skillset, but significant enhancements/rewrites
are not.
Bummer. Anyway, it seems that I'm not able to find the precise branch
which I was using for the work. I do not think I have kept it all
inside git.

Looking at the code, I do recall that I was detecting custom encoding
by checking the GfxFont::getFontEncoding() property. But I remember
definitely that there was more to it and I was tweaking poppler for
certain documents.

Hmm, OK. I'm a little concerned, looking at the code, that assumptions about how to map a character from a given font are made on a whole-font basis, not per-character. I'm not sure if there is support for fallback mechanisms in the algorithms that convert a PDF character to Unicode for pdftotext. For example, if a document has font X and I know that character A in that font should be remapped to Z, but I have no information on some other character B, I want to be able to specify the A->Z remapping without affecting whatever default is used to show the B character. I'm not sure if the code simply looks for the existence of a certain kind of translation table for each font and then assumes that the table is always complete - that would be sub-optimal for my use-case. Can someone shed light on that question?

If you have PDF examples where a single glyph is represented using
multiple character codes, that would be interesting to see - but would
not be a problem for a remapping algorithm (and I can imagine cases
where it would happen; in fact it essentially does happen already in
Unicode).  Many-to-one is easy.  One-to-many would obviously be
problematic - are you saying you've seen that too?  I thought that would
be impossible, assuming a font-aware algorithm.
Many-to-one exclusively. The PDFs were deleted and forgotten, as soon
as I was done with them. In several cases I have also simply given up,
due to the huge waste of time the activity was. At first I have also
thought about some UTF-8, but IIRC in one of the documents, the French
'ç' (the word "façade" was used often) was represented with something
like 3 characters. But it is 2 bytes in UTF-8. Believe me, I have
Note that accented characters can be represented several valid ways in Unicode - sometimes with a combined character, other times by using separate representations of the character and the accent. It might be that you were seeing 1 byte for "c" and 2 bytes for the "cedille" (sp?) accent. Not the most concise representation, but totally valid.
tried back then many many ways, trying to reverse-engineer the
encoding of the font. (I have just found that I still have the
"tesseract" OCR installed - and my scripts which were trying to use
the OCR to rebuild the unicode map. (And I see that it was fruitless:
OCR was failing on fancy modern punctuation and bold/italic
recognition.))


BTW, I found something else: pdftotext has the "-bbox" option. Albeit,
similarly to "pdftohtml -xml", it also requires "manual" reassembly of
the text later. With a very simple hack, one can add to the output of
the "-bbox" also the font name, size and style - they are stored in
the TextWord properties. That change I have (attached), though it is
for a very old version of poppler. Probably that would help.

Thanks! But: Does your hack show the font name etc for each character in each word, or just a value for the word? The former is what I'd need...


And good luck with your PDFs. You definitely need it.


P.S. Just for the sake of experiment. Open few of the PDFs without
encodings in a recent Acrobat Reader, "Select All", "Copy", switch to
word processor and "Paste". If the text in the word processor would
look as expected, not garbled, then you have on your hands a rare
"tagged PDF." Very unlikely, but worth a try. The "tags" allow to
store in the PDF extra information like formatted text, which Acrobat
can extract.
Right, thanks; I've done that. No go. Even if some of my PDFs are tagged, the vast majority are not - they come from a wide range of publishers and vintages.

Best,
--Jeff

N.B. PDFs might have attachments. In the past, I once came across a
PDF without the font encodings - but with the source WinWord document
attached. Worth checking.


On 6/11/2013 10:06 AM, Ihar `Philips` Filipau wrote:
Hi!

#1.
You can't make the global per-font table, as you envision it. The
embedded fonts often include only required symbols, meaning that
embedded versions of the same font might and do differ from document
to document - and consequently the character codes do differ too.

#2.
I worked on something similar long time ago. What I did was to modify
the pdftohtml to print the characters of fonts without unicode mapping
as raw codes, in the XML/HTML notation: &#<code>; (I can't remember
right now what trick I used to differentiate the fonts.) Finally,
semi-manually I was replacing the codes with real characters.


If there is a stopgap method by which I could add such info to
Poppler source somewhere and then recompile (hard-coding the table),
please let me know - I'm fine with that for short-term use though I
think a runtime table would be much much more flexible and useful.
I will try to locate my sources.
That would at least give you hints where to plug the tables.
But due to #1, you shouldn't trust too much such automated conversions.

P.S. I have also, seen the effect where single character was whyever
represented with *multiple* character codes. IOW, with some documents
character code -> unicode translation isn't possible, as it would be
leaving some garbage in the document.

On 6/11/13, Jeff Lerman <[email protected]> wrote:
Hi,

This is my first post to the list, and I apologize in advance for any
naivete revealed by my question.  However:

BACKGROUND:
I have a project for which my team is extracting text from a large
number (~100K) of PDF files from scientific publications.  These PDFs
come from a wide variety of sources.  They often use obscure-sounding
fonts for symbols, and those fonts do not seem to include toUnicode data
in the PDFs themselves.  The mapping in these fonts is not obvious and
needs to be determined on a case-by-case (often character-by-character
when the font info is unavailable online) basis.

I have been accumulating my own table of character mappings for those
fonts, focusing on characters of most interest to our team (certain
symbols).  I would like to be able to apply that table during
text-extraction by pdftotext, but I don't see any way to do that
currently.  Since complaints about obscure non-documented font/character
mappings are common online, application of such a table seems like
something that would be of potentially broad interest.

REQUEST:
Ideally, I'd like to be able to take a 3-column table (see below) that I
have built and supply it to pdftotext at runtime.  The table would be
applied in cases where a given character from a given font appears in a
PDF, no toUnicode table is supplied in the PDF, and the character does
appear in the supplied table (characters missing from the table would
continue to be extracted the way pdftotext does it today - i.e.,
characters missing from the table should have no effect).

The table would simply be a tab-delimited 3-column file with:
1. fontname, e.g. AdvP4C4E74 or AdvPi1 or YMath-Pack-Four, but NOT
things like NJIBIE+YMath-Pack-Four
2. font character (could supply an actual character, or a hexadecimal
codepoint)
3. desired Unicode mapping (again - could be an actual character or a
codepoint)

Exact table format isn't a big deal, but the above info is all that
should be needed.

If there is *already* a way to do this in pdftotext, please let me
know.  If there is a stopgap method by which I could add such info to
Poppler source somewhere and then recompile (hard-coding the table),
please let me know - I'm fine with that for short-term use though I
think a runtime table would be much much more flexible and useful.

Thanks!
--Jeff Lerman

_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler






_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to