WMJ wrote: > > Currently the parser is event-based. I would more love to have a DOM-like > thing. >
hmmmm... Well that was certainly part of the original design consideration. But when you are processing a stack based operator stream, and you have the potential for huge streams, an event based handler makes the most sense from an implementation perspective. As others are sure to point out, creating a DOM from the event model is actually not that hard to do. Heck, LocationTextExtractionStrategy effectively does this as it accumulates text operations (it's a pretty flat DOM, but that could be extended). At the end of the day, what we have heard from users is that they want to get text extracted from the page. Not access to every single draw operation... But there certainly could be use cases that aren't being considered. I think that it's also important to recognize that the PDF format doesn't lend itself to rich, multi-level data structures. For example, you outline the concept of sub-nodes in your sample code. What exactly would those sub-nodes contain? If you are expecting to see a DOM that consists of pages, paragraphs, sentences and words, I think you may be asking for something that PDF doesn't support. So, how do you envision using the information that is in the DOM structure that you describe? And how much state do you want to capture in every node? I could absolutely see an enhancement to LocationTextStrategy that would return a DOM of some sort (or at least a "rich" string - which would effectively be a DOM, instead of just a string - this has been the intent of the *Strategy objects from day one. -- View this message in context: http://itext-general.2136553.n4.nabble.com/Save-PDF-as-plain-text-tp4041246p4073263.html Sent from the iText - General mailing list archive at Nabble.com. ------------------------------------------------------------------------------ RSA(R) Conference 2012 Save $700 by Nov 18 Register now http://p.sf.net/sfu/rsa-sfdev2dev1 _______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
