RE: [iText-questions] Preliminary work on PDF content extraction

Paulo Soares Fri, 14 Nov 2003 08:41:25 -0800

Thank you. I'm curious as to why you didn't use JPEDAL, MULTIVALENT or other
existing library.
It's an interesting idea to have some text extraction capabilities.


Best Regards,
Paulo Soares

> -----Original Message-----
> From: Kevin Day [SMTP:[EMAIL PROTECTED]
> Sent: Thursday, November 13, 2003 16:48
> To:   [EMAIL PROTECTED]
> Subject:      [iText-questions] Preliminary work on PDF content extraction
> 
> I've been working on some classes for extracting meaningful content from
> an
> existing PDF file (in my case, I am primarily interested in extracting
> text), and I thought I'd share the current classes (in attached zip file).
> 
> The classes work by using iText to get at the content bytes of a
> particular
> page, then processing the content.
> 
> The content is parsed by the PDFContentStreamTokenizer class, which breaks
> the stream up into either PDFContentOperator objects (which represent an
> operator in the content stream) or String objects (which represent
> operands
> on the rendering stack - i.e. inputs that the operators are supposed to
> perform on).
> 
> Sub-classes of PDFContentOperatorProcessor are then created to implement
> customized processing of the operator/operand groups.  Right now, there
> are
> processors for:
> 
> - RawTextExtractor:  Retrieving all of the content in a single string,
> unformatted
> - SimpleFormattedTextExtractor:  Retrieving all of the content in a single
> string with linebreaks in the appropriate places
> - PhraseTextExtractor:  Retrieving all of the content in a single string
> with each "phrase" (i.e. group of words that are put into the PDF in a
> single operation) on a separate line
> 
> The PDFContentStreamProcessor class is used to tie it all together and
> make
> it easy to use.  I have included a simple ProcessorExerciser class which
> shows how to use each of the above extractors.
> 
> 
> It should be pretty straightforward to create new extractors.  This could
> even be used as the foundation for rendering the PDF in a Java UI.
> 
> 
> As a last comment:  The SimpleFormattedTextExtractor is by no means 100%
> solid.  Detecting line breaks is a bit tough, and I'm sure that I haven't
> completely accounted for all of the coordinate transformations, etc...
> that
> can happen in a PDF content stream.  It does appear to work on all of the
> PDFs I've tested it with, though.
> 
> 
> I'd love to get some feedback on the architecture, and any ideas you all
> might have.
> 
> I hope it's OK to post a ZIP file...
> 
> Cheers,
> 
> Kevin
> 
> 
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email sponsored by: ApacheCon 2003,
> 16-19 November in Las Vegas. Learn firsthand the latest
> developments in Apache, PHP, Perl, XML, Java, MySQL,
> WebDAV, and more! http://www.apachecon.com/
> _______________________________________________
> iText-questions mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/itext-questions << File:
> itextextensions.zip >> 


-------------------------------------------------------
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
_______________________________________________
iText-questions mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/itext-questions

RE: [iText-questions] Preliminary work on PDF content extraction

Reply via email to