Thank you. I'm curious as to why you didn't use JPEDAL, MULTIVALENT or other existing library. It's an interesting idea to have some text extraction capabilities.
Best Regards, Paulo Soares > -----Original Message----- > From: Kevin Day [SMTP:[EMAIL PROTECTED] > Sent: Thursday, November 13, 2003 16:48 > To: [EMAIL PROTECTED] > Subject: [iText-questions] Preliminary work on PDF content extraction > > I've been working on some classes for extracting meaningful content from > an > existing PDF file (in my case, I am primarily interested in extracting > text), and I thought I'd share the current classes (in attached zip file). > > The classes work by using iText to get at the content bytes of a > particular > page, then processing the content. > > The content is parsed by the PDFContentStreamTokenizer class, which breaks > the stream up into either PDFContentOperator objects (which represent an > operator in the content stream) or String objects (which represent > operands > on the rendering stack - i.e. inputs that the operators are supposed to > perform on). > > Sub-classes of PDFContentOperatorProcessor are then created to implement > customized processing of the operator/operand groups. Right now, there > are > processors for: > > - RawTextExtractor: Retrieving all of the content in a single string, > unformatted > - SimpleFormattedTextExtractor: Retrieving all of the content in a single > string with linebreaks in the appropriate places > - PhraseTextExtractor: Retrieving all of the content in a single string > with each "phrase" (i.e. group of words that are put into the PDF in a > single operation) on a separate line > > The PDFContentStreamProcessor class is used to tie it all together and > make > it easy to use. I have included a simple ProcessorExerciser class which > shows how to use each of the above extractors. > > > It should be pretty straightforward to create new extractors. This could > even be used as the foundation for rendering the PDF in a Java UI. > > > As a last comment: The SimpleFormattedTextExtractor is by no means 100% > solid. Detecting line breaks is a bit tough, and I'm sure that I haven't > completely accounted for all of the coordinate transformations, etc... > that > can happen in a PDF content stream. It does appear to work on all of the > PDFs I've tested it with, though. > > > I'd love to get some feedback on the architecture, and any ideas you all > might have. > > I hope it's OK to post a ZIP file... > > Cheers, > > Kevin > > > > > > > ------------------------------------------------------- > This SF.Net email sponsored by: ApacheCon 2003, > 16-19 November in Las Vegas. Learn firsthand the latest > developments in Apache, PHP, Perl, XML, Java, MySQL, > WebDAV, and more! http://www.apachecon.com/ > _______________________________________________ > iText-questions mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/itext-questions << File: > itextextensions.zip >> ------------------------------------------------------- This SF.Net email sponsored by: ApacheCon 2003, 16-19 November in Las Vegas. Learn firsthand the latest developments in Apache, PHP, Perl, XML, Java, MySQL, WebDAV, and more! http://www.apachecon.com/ _______________________________________________ iText-questions mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/itext-questions