WMJ wrote:
> 
> Currently the parser is event-based. I would more love to have a DOM-like
> thing.
> 

hmmmm...  Well that was certainly part of the original design consideration. 
But when you are processing a stack based operator stream, and you have the
potential for huge streams, an event based handler makes the most sense from
an implementation perspective.  As others are sure to point out, creating a
DOM from the event model is actually not that hard to do.  Heck,
LocationTextExtractionStrategy effectively does this as it accumulates text
operations (it's a pretty flat DOM, but that could be extended).

At the end of the day, what we have heard from users is that they want to
get text extracted from the page.  Not access to every single draw
operation...  But there certainly could be use cases that aren't being
considered.

I think that it's also important to recognize that the PDF format doesn't
lend itself to rich, multi-level data structures.  For example, you outline
the concept of sub-nodes in your sample code.  What exactly would those
sub-nodes contain?  If you are expecting to see a DOM that consists of
pages, paragraphs, sentences and words, I think you may be asking for
something that PDF doesn't support.


So, how do you envision using the information that is in the DOM structure
that you describe?  And how much state do you want to capture in every node?

I could absolutely see an enhancement to LocationTextStrategy that would
return a DOM of some sort (or at least a "rich" string - which would
effectively be a DOM, instead of just a string - this has been the intent of
the *Strategy objects from day one.





--
View this message in context: 
http://itext-general.2136553.n4.nabble.com/Save-PDF-as-plain-text-tp4041246p4073263.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to