Thank you. I'm curious as to why you didn't use JPEDAL, MULTIVALENT or other
existing library.
It's an interesting idea to have some text extraction capabilities.

Best Regards,
Paulo Soares

> -----Original Message-----
> From: Kevin Day [SMTP:[EMAIL PROTECTED]
> Sent: Thursday, November 13, 2003 16:48
> To:   [EMAIL PROTECTED]
> Subject:      [iText-questions] Preliminary work on PDF content extraction
> 
> I've been working on some classes for extracting meaningful content from
> an
> existing PDF file (in my case, I am primarily interested in extracting
> text), and I thought I'd share the current classes (in attached zip file).
> 
> The classes work by using iText to get at the content bytes of a
> particular
> page, then processing the content.
> 
> The content is parsed by the PDFContentStreamTokenizer class, which breaks
> the stream up into either PDFContentOperator objects (which represent an
> operator in the content stream) or String objects (which represent
> operands
> on the rendering stack - i.e. inputs that the operators are supposed to
> perform on).
> 
> Sub-classes of PDFContentOperatorProcessor are then created to implement
> customized processing of the operator/operand groups.  Right now, there
> are
> processors for:
> 
> - RawTextExtractor:  Retrieving all of the content in a single string,
> unformatted
> - SimpleFormattedTextExtractor:  Retrieving all of the content in a single
> string with linebreaks in the appropriate places
> - PhraseTextExtractor:  Retrieving all of the content in a single string
> with each "phrase" (i.e. group of words that are put into the PDF in a
> single operation) on a separate line
> 
> The PDFContentStreamProcessor class is used to tie it all together and
> make
> it easy to use.  I have included a simple ProcessorExerciser class which
> shows how to use each of the above extractors.
> 
> 
> It should be pretty straightforward to create new extractors.  This could
> even be used as the foundation for rendering the PDF in a Java UI.
> 
> 
> As a last comment:  The SimpleFormattedTextExtractor is by no means 100%
> solid.  Detecting line breaks is a bit tough, and I'm sure that I haven't
> completely accounted for all of the coordinate transformations, etc...
> that
> can happen in a PDF content stream.  It does appear to work on all of the
> PDFs I've tested it with, though.
> 
> 
> I'd love to get some feedback on the architecture, and any ideas you all
> might have.
> 
> I hope it's OK to post a ZIP file...
> 
> Cheers,
> 
> Kevin
> 
> 
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email sponsored by: ApacheCon 2003,
> 16-19 November in Las Vegas. Learn firsthand the latest
> developments in Apache, PHP, Perl, XML, Java, MySQL,
> WebDAV, and more! http://www.apachecon.com/
> _______________________________________________
> iText-questions mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/itext-questions << File:
> itextextensions.zip >> 


-------------------------------------------------------
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
_______________________________________________
iText-questions mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Reply via email to