Re: [iText-questions] Initial code for text extraction from PDF using iText

Kevin Day Mon, 03 Nov 2008 12:59:28 -0800

The text extraction I am working with is content stream text extraction (obtaining unicode text streams from the text that renders on-screen). If you are talking about extracting the embedded xml tags from PDF files, this is not a solution for you.

- K

----------------------- Original Message -----------------------

From: jmrrva <[EMAIL PROTECTED]>

To: "Post all your questions about iText here" <itext-questions@lists.sourceforge.net>

Cc:

Date: Mon, 3 Nov 2008 21:32:38 +0100

Subject: Re: [iText-questions] Initial code for text extraction from PDF using iText

Hi Kevin,

Should it be suitable for PDF to XML conversion? I have hundreds of
tagged pdfs (PDF/A 1.A) and I would like to convert them into XML,
using the PDF tags as XML elements and keeping the read order of the
PDF/A.1A.

Now the only tool I know to do this is the Professional version of
Adobe's Acrobat.

On Mon, Nov 3, 2008 at 9:17 PM, Kevin Day <[EMAIL PROTECTED]> wrote:
> Hi all-
>
> I've put together a first cut for text extraction from PDF content streams (attached - hopefully the list will allow zip files to be sent). Hopefully I can get some feedback and/or suggestions on next steps.
>
> This is still a bit rough, but I think it is a good start. Use the PdfContentReaderTool class as your main() class, with a single command line argument holding the path of the PDF file you wish to process. This will output information about each page content, including the extracted text.
>
>
>
> Here are my comments:
>
> 1. Instead of parsing the cmap itself, I used an external open source library (FontBox-0.1.0-dev.jar) - this is easy enough to re-write, but it didn't seem like that rewrite was adding a lot of value, so I took the shortcut. If the direction I'm headed is appealing, I can re-write the cmap portion of the code (it's already mostly coded in the DocumentFont class - just need to take the information and put it into a usable cmap, plus ensure that the information gets processed for any font that has a ToUnicode entry). You will need to have the FontBox jar on your classpath for things to compile and run.
>
> 2. I think that it would be better to have the ToUnicode dictionary processed inside the font classes themselves (probably in DocumentFont). I had to force cmaps into the font classes using a sub-cl ass - not pretty, but functional.
>
> 3. I have not performed any sort of exhaustive test with different types of fonts, etc... I have tested with PDF files containing the ToUnicode dictionary entry on their fonts. My guess is that for fonts that do *not* have a ToUnicoce entry and that have a UNC -> glyph mapping that isn't ANSI that things will break down rather quickly. Addressing this would definitely require adjustments to the DocumentFont class.
>
> 4. The use of the Tj operator (lower case 'j') in a source PDF may result in incorrect inter-word space prediciton. I know how to fix this, but didn't want to take the time without feedback.
>
>
>
> Right now the text extraction is pretty simple - it handles inter-word space prediciton which is one of the harder things to deal with in PDF content streams, but it doesn't do full spatial analysis, etc... to determine the actual layout of wo rds relative to each other, or perform spacial filtering, etc... There is nothing to prevent this from being added, though - the PdfContentStreamProcessor#displayText() callback provides enough state to allow this to happen.
>
>
> Please let me know if you have any feedback or suggestions!
>
> Cheers,
>
> - K
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url="">
> _______________________________________________
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> Buy the iText book: http://www.1t3xt.com/docs/book.php
>

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url="">
_______________________________________________
iTe xt-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/

_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions


Buy the iText book: http://www.1t3xt.com/docs/book.php

Re: [iText-questions] Initial code for text extraction from PDF using iText

Reply via email to