Rather than going through the content stream looking for text operators (BT & 
ET with a Tj, ' or " between them... sounds like a job for regex), you could 
instead look at the page's resources.
 
If there are any font resources, one could reasonably conclude that the page 
contains text.  There are a couple 'gotchas'.  You can draw text using only 
path operators... inefficient but legal.  Further, you aren't required to use a 
font resource.  Someone could theoretically include a font and never use it... 
Note that I've never examined a PDF with this particular aberration.
 
If, on the other hand, the page contained only XObject Image resources, you 
could feel reasonably safe that it was an image-bearing PDF.
 
if (!hasFonts(page) and hasOnlyImages(page)) {
  // it's an image-only PDF
}
 
Writing the font & image tests is going to require Intimate PDF Knowledge.
 
For example:
 
There are two kinds of XObject resources, Images & Forms.  Images are bit-map 
data, Forms are compartmentalized content streams of their own with their own 
resources... similar to an EPS file.  You should recursively test the Form 
XObjects to see if they contain text or are just images too.  Forms are often 
used as wrappers for Images.
 
If you go the "check the content stream" route, you'll still have to examine 
XObject resource calls because an XObject Form can contain fonts/text, but is 
drawn by the same content stream command as an XObject Image:
"
<resource name> /Do
"
 
Note that the 'font test' method wouldn't work if a scanning program added a 
text header/footer, watermark, or the like to it's output.  If you wanted to 
ignore footers, you'd have to figure out where all the text in a content stream 
is being drawn and ignore certain regions.  "Non-trivial" takes on a whole new 
meaning.
 
--Mark Storer 
  Senior Software Engineer 
  Cardiff.com

#include <disclaimer> 
typedef std::Disclaimer<Cardiff> DisCard; 

 
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/

Reply via email to