Re: [iText-questions] Initial code for text extraction from PDF using iText

jmrrva Mon, 03 Nov 2008 12:32:57 -0800

Hi Kevin,

Should it be suitable for PDF to XML conversion? I have hundreds of
tagged pdfs (PDF/A 1.A) and I would like to convert them into XML,
using the PDF tags as XML elements and keeping the read order of the
PDF/A.1A.


Now the only tool I know to do this is the Professional version of
Adobe's Acrobat.

On Mon, Nov 3, 2008 at 9:17 PM, Kevin Day <[EMAIL PROTECTED]> wrote:
> Hi all-
>
> I've put together a first cut for text extraction from PDF content streams 
> (attached - hopefully the list will allow zip files to be sent).  Hopefully I 
> can get some feedback and/or suggestions on next steps.
>
> This is still a bit rough, but I think it is a good start.  Use the 
> PdfContentReaderTool class as your main() class, with a single command line 
> argument holding the path of the PDF file you wish to process.  This will 
> output information about each page content, including the extracted text.
>
>
>
> Here are my comments:
>
> 1.  Instead of parsing the cmap itself, I used an external open source 
> library (FontBox-0.1.0-dev.jar) - this is easy enough to re-write, but it 
> didn't seem like that rewrite was adding a lot of value, so I took the 
> shortcut.  If the direction I'm headed is appealing, I can re-write the cmap 
> portion of the code (it's already mostly coded in the DocumentFont class - 
> just need to take the information and put it into a usable cmap, plus ensure 
> that the information gets processed for any font that has a ToUnicode entry). 
>  You will need to have the FontBox jar on your classpath for things to 
> compile and run.
>
> 2.  I think that it would be better to have the ToUnicode dictionary 
> processed inside the font classes themselves (probably in DocumentFont).  I 
> had to force cmaps into the font classes using a sub-class - not pretty, but 
> functional.
>
> 3.  I have not performed any sort of exhaustive test with different types of 
> fonts, etc...  I have tested with PDF files containing the ToUnicode 
> dictionary entry on their fonts.  My guess is that for fonts that do *not* 
> have a ToUnicoce entry and that have a UNC -> glyph mapping that isn't ANSI 
> that things will break down rather quickly.  Addressing this would definitely 
> require adjustments to the DocumentFont class.
>
> 4.  The use of the Tj operator (lower case 'j') in a source PDF may result in 
> incorrect inter-word space prediciton.  I know how to fix this, but didn't 
> want to take the time without feedback.
>
>
>
> Right now the text extraction is pretty simple - it handles inter-word space 
> prediciton which is one of the harder things to deal with in PDF content 
> streams, but it doesn't do full spatial analysis, etc... to determine the 
> actual layout of words relative to each other, or perform spacial filtering, 
> etc...  There is nothing to prevent this from being added, though - the 
> PdfContentStreamProcessor#displayText() callback provides enough state to 
> allow this to happen.
>
>
> Please let me know if you have any feedback or suggestions!
>
> Cheers,
>
> - K
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> Buy the iText book: http://www.1t3xt.com/docs/book.php
>

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

Re: [iText-questions] Initial code for text extraction from PDF using iText

Reply via email to