Hi Kevin, Should it be suitable for PDF to XML conversion? I have hundreds of tagged pdfs (PDF/A 1.A) and I would like to convert them into XML, using the PDF tags as XML elements and keeping the read order of the PDF/A.1A.
Now the only tool I know to do this is the Professional version of Adobe's Acrobat. On Mon, Nov 3, 2008 at 9:17 PM, Kevin Day <[EMAIL PROTECTED]> wrote: > Hi all- > > I've put together a first cut for text extraction from PDF content streams > (attached - hopefully the list will allow zip files to be sent). Hopefully I > can get some feedback and/or suggestions on next steps. > > This is still a bit rough, but I think it is a good start. Use the > PdfContentReaderTool class as your main() class, with a single command line > argument holding the path of the PDF file you wish to process. This will > output information about each page content, including the extracted text. > > > > Here are my comments: > > 1. Instead of parsing the cmap itself, I used an external open source > library (FontBox-0.1.0-dev.jar) - this is easy enough to re-write, but it > didn't seem like that rewrite was adding a lot of value, so I took the > shortcut. If the direction I'm headed is appealing, I can re-write the cmap > portion of the code (it's already mostly coded in the DocumentFont class - > just need to take the information and put it into a usable cmap, plus ensure > that the information gets processed for any font that has a ToUnicode entry). > You will need to have the FontBox jar on your classpath for things to > compile and run. > > 2. I think that it would be better to have the ToUnicode dictionary > processed inside the font classes themselves (probably in DocumentFont). I > had to force cmaps into the font classes using a sub-class - not pretty, but > functional. > > 3. I have not performed any sort of exhaustive test with different types of > fonts, etc... I have tested with PDF files containing the ToUnicode > dictionary entry on their fonts. My guess is that for fonts that do *not* > have a ToUnicoce entry and that have a UNC -> glyph mapping that isn't ANSI > that things will break down rather quickly. Addressing this would definitely > require adjustments to the DocumentFont class. > > 4. The use of the Tj operator (lower case 'j') in a source PDF may result in > incorrect inter-word space prediciton. I know how to fix this, but didn't > want to take the time without feedback. > > > > Right now the text extraction is pretty simple - it handles inter-word space > prediciton which is one of the harder things to deal with in PDF content > streams, but it doesn't do full spatial analysis, etc... to determine the > actual layout of words relative to each other, or perform spacial filtering, > etc... There is nothing to prevent this from being added, though - the > PdfContentStreamProcessor#displayText() callback provides enough state to > allow this to happen. > > > Please let me know if you have any feedback or suggestions! > > Cheers, > > - K > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's challenge > Build the coolest Linux based applications with Moblin SDK & win great prizes > Grand prize is a trip for two to an Open Source event anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > _______________________________________________ > iText-questions mailing list > iText-questions@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/itext-questions > > Buy the iText book: http://www.1t3xt.com/docs/book.php > ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.1t3xt.com/docs/book.php