Hello,
Firstly I agree that it is easy to convert the current event model to DOM
model. And I've done already implemented a very basic model with one or two
days' work.
Currently I've processed quite some PDF files and I think huge page command
trees are rare. Few PDF documents contain page contents more than 100KB per
page. A DOM model is quite affordable. None to mention the fact that there are
already quite a lot of PDF editor or processors out there. They do have their
internal structure for those PDF objects to support content editing.
With the DOM model mentioned above, the developers who want to extract and
analyze text can traverse the DOM tree and grab all PdfShowTextCommand objects.
By inspecting PdfShowTextCommand object, they immediately know the font, size,
position, color about those text pieces. A PDF rendering processor named MuPDF
appears to have a similar API to extract texts.
And...
Although we all know that the PDF commands are linear. However, according to
the PDF specification, there are de facto "multi-level" structures. For
example, text commands must be placed within a pair of BT and ET command, and a
pair of q and Q command encompasses graphic commands within a scope. In the DOM
model, we don't need to worry about "whether I've added an ET command after the
BT or not". A PdfTextAreaCommand denotes the BT and implies ET after all its
sub-commands. Sub-commands of the PdfTextAreaCommand can be
PdfTextMatrixCommmand, PdfShowTextCommand, PdfFontCommand, etc.
We might need listen to other people's opinions and requirements on PDF content
processing.
A job that is easy to do doesn't mean that it is a nonsense. If integrating it
into iText can save other programmers' days, doing such kind of low-tech jobs
may be meaningful indeed.
I am currently experimenting on the PDF page command DOM model (I need support
above the font encoding, font subsetting, and more, and more aspects that iText
lacks). A good thing about the DOM model is that we don't have to create many
small classes to consume the PDF command events. A single class may do a
variety of jobs against the same amount of content. I am trying to program an
application to filtering out unwanted parts, or batch modifying parts in PDF
pages. Event model is not so sufficient or effective when doing this. I may try
to find out more and improve the design.
WMJ.
>________________________________
>From: Kevin Day <[email protected]>
>
>hmmmm... Well that was certainly part of the original design consideration.
>But when you are processing a stack based operator stream, and you have the
>potential for huge streams, an event based handler makes
the most sense from
>an implementation perspective. As others are sure to point out, creating a
>DOM from the event model is actually not that hard to do. Heck,
>LocationTextExtractionStrategy effectively does this as it accumulates text
>operations (it's a pretty flat DOM, but that could be extended).
>
>At the end of the day, what we have heard from users is that they want to
>get text extracted from the page. Not access to every single draw
>operation... But there certainly could be use cases that aren't being
>considered.
>
>I think that it's also important to recognize that the PDF format doesn't
>lend itself to rich, multi-level data structures. For example, you outline
>the concept of sub-nodes in your sample code. What exactly would those
>sub-nodes contain? If you are expecting to see a DOM that consists of
>pages, paragraphs, sentences and words, I think you may be asking
for
>something that PDF doesn't support.
>
>
>So, how do you envision using the information that is in the DOM structure
>that you describe? And how much state do you want to capture in every node?
>
>I could absolutely see an enhancement to LocationTextStrategy that would
>return a DOM of some sort (or at least a "rich" string - which would
>effectively be a DOM, instead of just a string - this has been the intent of
>the *Strategy objects from day one.
>
>
>
>
>
>--
>View this message in context:
>http://itext-general.2136553.n4.nabble.com/Save-PDF-as-plain-text-tp4041246p4073263.html
>Sent from the iText - General mailing list archive at Nabble.com.
>
>------------------------------------------------------------------------------
>RSA(R) Conference 2012
>Save $700 by Nov
18
>Register
now
>http://p.sf.net/sfu/rsa-sfdev2dev1
>_______________________________________________
>iText-questions mailing list
>[email protected]
>https://lists.sourceforge.net/lists/listinfo/itext-questions
>
>iText(R) is a registered trademark of 1T3XT BVBA.
>Many questions posted to this list can (and will) be answered with a reference
>to the iText book: http://www.itextpdf.com/book/
>Please check the keywords list before you ask for examples:
>http://itextpdf.com/themes/keywords.php
>
>
>
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples:
http://itextpdf.com/themes/keywords.php