I've been working on some classes for extracting meaningful content from an existing PDF file (in my case, I am primarily interested in extracting text),
Great.
Any reason not to use existing tools such as JPEDAL, PdfBox or Multivalent?? All of those have complete content extraction architectures...
The content is parsed by the PDFContentStreamTokenizer class, which breaks the stream up into either PDFContentOperator objects (which represent an operator in the content stream) or String objects (which represent operands on the rendering stack - i.e. inputs that the operators are supposed to perform on).
This is a good novice approach to the problem, but you will find out quite quickly when trying to extend this that it doesn't scale :(.
If you want to continue to build on your system, you will really want to look at breaking the stream down into "objects" - much like (or even exactly like) the ones used in the PDF structure (ie. CosObjects).
Leonard -- --------------------------------------------------------------------------- Leonard Rosenthol <mailto:[EMAIL PROTECTED]> Chief Technical Officer <http://www.pdfsages.com> PDF Sages, Inc. 215-629-3700 (voice) 215-629-0789 (fax)
------------------------------------------------------- This SF. Net email is sponsored by: GoToMyPC GoToMyPC is the fast, easy and secure way to access your computer from any Web browser or wireless device. Click here to Try it Free! https://www.gotomypc.com/tr/OSDN/AW/Q4_2003/t/g22lp?Target=mm/g22lp.tmpl _______________________________________________ iText-questions mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/itext-questions