Re: [iText-questions] Preliminary work on PDF content extraction

Leonard Rosenthol Sat, 15 Nov 2003 16:22:47 -0800

At 9:48 AM -0700 11/13/03, Kevin Day wrote:

I've been working on some classes for extracting meaningful content from an
existing PDF file (in my case, I am primarily interested in extracting
text),

Great.

Any reason not to use existing tools such as JPEDAL, PdfBox or Multivalent?? All of those have complete content extraction architectures...

The content is parsed by the PDFContentStreamTokenizer class, which breaks
the stream up into either PDFContentOperator objects (which represent an
operator in the content stream) or String objects (which represent operands
on the rendering stack - i.e. inputs that the operators are supposed to
perform on).

This is a good novice approach to the problem, but you will find out quite quickly when trying to extend this that it doesn't scale :(.

If you want to continue to build on your system, you will really want to look at breaking the stream down into "objects" - much like (or even exactly like) the ones used in the PDF structure (ie. CosObjects).


Leonard
--
---------------------------------------------------------------------------
Leonard Rosenthol                            <mailto:[EMAIL PROTECTED]>
Chief Technical Officer                      <http://www.pdfsages.com>
PDF Sages, Inc.                              215-629-3700 (voice)
                                             215-629-0789 (fax)


-------------------------------------------------------
This SF. Net email is sponsored by: GoToMyPC
GoToMyPC is the fast, easy and secure way to access your computer from
any Web browser or wireless device. Click here to Try it Free!
https://www.gotomypc.com/tr/OSDN/AW/Q4_2003/t/g22lp?Target=mm/g22lp.tmpl
_______________________________________________
iText-questions mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Re: [iText-questions] Preliminary work on PDF content extraction

Reply via email to