Rather than going through the content stream looking for text operators (BT &
ET with a Tj, ' or " between them... sounds like a job for regex), you could
instead look at the page's resources.
If there are any font resources, one could reasonably conclude that the page
contains text. There are a couple 'gotchas'. You can draw text using only
path operators... inefficient but legal. Further, you aren't required to use a
font resource. Someone could theoretically include a font and never use it...
Note that I've never examined a PDF with this particular aberration.
If, on the other hand, the page contained only XObject Image resources, you
could feel reasonably safe that it was an image-bearing PDF.
if (!hasFonts(page) and hasOnlyImages(page)) {
// it's an image-only PDF
}
Writing the font & image tests is going to require Intimate PDF Knowledge.
For example:
There are two kinds of XObject resources, Images & Forms. Images are bit-map
data, Forms are compartmentalized content streams of their own with their own
resources... similar to an EPS file. You should recursively test the Form
XObjects to see if they contain text or are just images too. Forms are often
used as wrappers for Images.
If you go the "check the content stream" route, you'll still have to examine
XObject resource calls because an XObject Form can contain fonts/text, but is
drawn by the same content stream command as an XObject Image:
"
<resource name> /Do
"
Note that the 'font test' method wouldn't work if a scanning program added a
text header/footer, watermark, or the like to it's output. If you wanted to
ignore footers, you'd have to figure out where all the text in a content stream
is being drawn and ignore certain regions. "Non-trivial" takes on a whole new
meaning.
--Mark Storer
Senior Software Engineer
Cardiff.com
#include <disclaimer>
typedef std::Disclaimer<Cardiff> DisCard;
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/