Re: [iText-questions] itext, PdfTextExtractor

Kevin Day Fri, 12 Dec 2008 12:36:05 -0800

Peter- Thanks for getting in touch - I am replying to the full iText mailing list so everyone can participate.

The code contribution to date has been focused on solving our own internal needs - but I am absolutely (and extremely) interested in any improvements/fixes/etc... to the text extraction algorithms. There isn't really a release or fixes planned at this point, but if there are improvements available, and I can incorporate them without damaging existing code, then I see no reason that they couldn't be added.

One severe limitation that I have is that all of my work is done in the U.S., so my awareness of various encodings is quite limited, and my ability to test is even more limited.

What I would absolutely love to do is to develop a suite of PDF source documents that demonstrate tricky aspects of the text extraction process, then include evaluation of those in a suite of unit tests. I think by doing this that we can ensure that changes to the algorithms don't adversly effect PDF content that is already properly handled. These PDF files would need to be relatively short, and I'm thinking along the lines of having each PDF be accompanied by a .txt file that contains the text that *should* be extracted from the PDF. The unit test can then do the extraction and compare to the .txt.

If anyone else on the list has suggestions for improving the above testing strategy, or recommendations on alternatives, please let me know.

So, in answer to your umlaut issue, I think the place to start is to send me a simple PDF and txt file that demonstrates the issue, plus a patch that demonstrates the fix. I'll build a unit test around that PDF/txt file combination.

Then once we have a test in place that shows that the algorithm does indeed fail to handle the extraction properly, I'll apply your patch, and we'll confirm that it does indeed fix the problem (without breaking extraction of text from other PDF files!).

I would like to get information regarding the 'no consideration of numbers in a Tj operation' - if you can shoot me a description of what's going on, or a sample file (or even the patch showing how you addressed the limitiation), I'd like to understand what's going on. To my knowledge, the current algorithm is pretty much ignorant exactly which characters are involved in space determination - it just determines if there is a space based on comparison of the end X position of character X with the start X position of character X+1.

Handling of the Do operator for inclusion of XObject information - this is *definitely* something that is needed. Note that it is quite conceivable (and even highly likely) that the current Simple algorithm would case text brought in by an XObject to wind up out of natural reading order (the prototypical example is inclusion of the text 'Page X of Y'. The PDF might have 'Page X of' added to each page, then use an XObject to include the 'Y' on each page. If the Do operator doesn't come immediately after the Tj operator that places 'Page X of', then the Simple algorithm will never be able to provide the text 'Page X of Y' - instead the string 'Y' will show up at some other place, whereever the 'Do' operation occured).

The Simple algorithm itself is, as you've found, quite simple and is absolutely guaranteed to fail in a number of situations (thus the name 'simple') - it works for our needs, but it is far from a full robust solution.

A better approach would be to actually process the entire content stream for a page, and build a spatially aware text table representing all of the text on the page, *then* do the processing required to determine which phrases follow each other, where paragraph breaks should go, etc... What makes this a little tricky is that you have to capture a lot of text state for each location on the page (including font state) because you need this information to correctly determine intra-word spaces.

Sorry for the long response - but the short answer is: By all means, shoot me your patch and I'll take a look at it!

- K

----------------------- Original Message -----------------------

I'm evaluating itext for use in a pdf handling java
application. One feature we need is the text extracting
capability, but I ran into some problems trying
PdfTextExtractor. These problems are

- incorrect encoding in german umlauts
- no consideration of numbers in the Tj-operator when
recognizing space between words
- no support for the Do-operator for handling form XObjects

Is there a release date for the next update?
Are there plans to solve this problems?

I tried a little bit and could solve the two first
problems by modifying PdfContentStreamProcessor and
SimpleTextExtractingPdfContentStreamProcessor. Are you
interested in these changes? May be you can review and
integrate them in the next update of itext.

Yours sincerely,
Peter Zorn

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions


Buy the iText book: http://www.1t3xt.com/docs/book.php

Re: [iText-questions] itext, PdfTextExtractor

Reply via email to