A Dimecres, 4 de novembre de 2009, Piotr Findeisen va escriure: > Hi! > > I run across a problem that pdftohtml and pdftotext sometimes outputs > hidden text, even when not using -hidden switch (in pdftohtml). > Example: > > wget -c http://www2.ing.unipi.it/ew2002/proceedings/114.pdf && pdftotext > 114.pdf - | grep 'Picture to be added here' > > When you view http://www2.ing.unipi.it/ew2002/proceedings/114.pdf in > Kpdf or Acrobat, you can search for 'Picture to be added here' — it's on > the first page, right under the " Typical BWA network layout." image. > But well, it's not really displayed there. > > "pdftohtml -xml -i -c -f 1 -l 1 -noframes 114.pdf" lists this text as > <fontspec id="16" size="13" family="Times" color="#0000ff"/> > but it gives no clue that the text is not printed on the screen. > > Is this some special feature of PDF that causes some text to be not > displayed or displayed with 0% opacity?
From a quick look at the code it seems the code is creating a clip path outside where the text is rendered, effectively rendering nothing. > Is it possible to capture this meta data with pdftohtml or generally > with poppler suite? It is, you'll have to make the text tools take the clip areas into account, not an easy task. Albert > > best regards, > Piotr > _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
