Hi! I run across a problem that pdftohtml and pdftotext sometimes outputs hidden text, even when not using -hidden switch (in pdftohtml). Example:
wget -c http://www2.ing.unipi.it/ew2002/proceedings/114.pdf && pdftotext 114.pdf - | grep 'Picture to be added here' When you view http://www2.ing.unipi.it/ew2002/proceedings/114.pdf in Kpdf or Acrobat, you can search for 'Picture to be added here' — it's on the first page, right under the " Typical BWA network layout." image. But well, it's not really displayed there. "pdftohtml -xml -i -c -f 1 -l 1 -noframes 114.pdf" lists this text as <fontspec id="16" size="13" family="Times" color="#0000ff"/> but it gives no clue that the text is not printed on the screen. Is this some special feature of PDF that causes some text to be not displayed or displayed with 0% opacity? Is it possible to capture this meta data with pdftohtml or generally with poppler suite? best regards, Piotr
signature.asc
Description: OpenPGP digital signature
_______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
