[poppler] pdftohtml outputs hidden text

Piotr Findeisen Wed, 04 Nov 2009 02:38:45 -0800

Hi!

I run across a problem that pdftohtml and pdftotext sometimes outputs
hidden text, even when not using -hidden switch (in pdftohtml).
Example:


wget -c http://www2.ing.unipi.it/ew2002/proceedings/114.pdf && pdftotext
114.pdf - | grep 'Picture to be added here'

When you view http://www2.ing.unipi.it/ew2002/proceedings/114.pdf in
Kpdf or Acrobat, you can search for 'Picture to be added here' — it's on
the first page, right under the " Typical BWA network layout." image.
But well, it's not really displayed there.

"pdftohtml -xml -i -c -f 1 -l 1 -noframes 114.pdf" lists this text as
<fontspec id="16" size="13" family="Times" color="#0000ff"/>
but it gives no clue that the text is not printed on the screen.

Is this some special feature of PDF that causes some text to be not
displayed or displayed with 0% opacity?
Is it possible to capture this meta data with pdftohtml or generally
with poppler suite?

best regards,
Piotr

signature.asc
Description: OpenPGP digital signature

_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

[poppler] pdftohtml outputs hidden text

Reply via email to