Re: [iText-questions] Save PDF as plain text

WMJ Thu, 17 Nov 2011 01:37:05 -0800

Hello,


Firstly I agree that it is easy to convert the current event model to DOM 
model. And I've done already implemented a very basic model with one or two 
days' work.


Currently I've processed quite some PDF files and I think huge page command 
trees are rare. Few PDF documents contain page contents more than 100KB per 
page. A DOM model is quite affordable. None to mention the fact that there are 
already quite a lot of PDF editor or processors out there. They do have their 
internal structure for those PDF objects to support content editing.


With the DOM model mentioned above, the developers who want to extract and 
analyze text can traverse the DOM tree and grab all PdfShowTextCommand objects. 
By inspecting PdfShowTextCommand object, they immediately know the font, size, 
position, color about those text pieces. A PDF rendering processor named MuPDF 
appears to have a similar API to extract texts.




And...



Although we all know that the PDF commands are linear. However, according to 
the PDF specification, there are de facto "multi-level" structures. For 
example, text commands must be placed within a pair of BT and ET command, and a 
pair of q and Q command encompasses graphic commands within a scope. In the DOM 
model, we don't need to worry about "whether I've added an ET command after the 
BT or not". A PdfTextAreaCommand denotes the BT and implies ET after all its 
sub-commands. Sub-commands of the PdfTextAreaCommand can be 
PdfTextMatrixCommmand, PdfShowTextCommand, PdfFontCommand, etc.




We might need listen to other people's opinions and requirements on PDF content 
processing.

A job that is easy to do doesn't mean that it is a nonsense. If integrating it 
into iText can save other programmers' days, doing such kind of low-tech jobs 
may be meaningful indeed.

I am currently experimenting on the PDF page command DOM model (I need support 
above the font encoding, font subsetting, and more, and more aspects that iText 
lacks). A good thing about the DOM model is that we don't have to create many 
small classes to consume the PDF command events.  A single class may do a 
variety of jobs against the same amount of content. I am trying to program an 
application to filtering out unwanted parts, or batch modifying parts in PDF 
pages. Event model is not so sufficient or effective when doing this. I may try 
to find out more and improve the design.

WMJ.




>________________________________
>From: Kevin Day <[email protected]>
>
>hmmmm...  Well that was certainly part of the original design consideration. 
>But when you are processing a stack based operator stream, and you have the
>potential for huge streams, an event based handler makes
 the most sense from
>an implementation perspective.  As others are sure to point out, creating a
>DOM from the event model is actually not that hard to do.  Heck,
>LocationTextExtractionStrategy effectively does this as it accumulates text
>operations (it's a pretty flat DOM, but that could be extended).
>
>At the end of the day, what we have heard from users is that they want to
>get text extracted from the page.  Not access to every single draw
>operation...  But there certainly could be use cases that aren't being
>considered.
>
>I think that it's also important to recognize that the PDF format doesn't
>lend itself to rich, multi-level data structures.  For example, you outline
>the concept of sub-nodes in your sample code.  What exactly would those
>sub-nodes contain?  If you are expecting to see a DOM that consists of
>pages, paragraphs, sentences and words, I think you may be asking
 for
>something that PDF doesn't support.
>
>
>So, how do you envision using the information that is in the DOM structure
>that you describe?  And how much state do you want to capture in every node?
>
>I could absolutely see an enhancement to LocationTextStrategy that would
>return a DOM of some sort (or at least a "rich" string - which would
>effectively be a DOM, instead of just a string - this has been the intent of
>the *Strategy objects from day one.
>
>
>
>
>
>--
>View this message in context: 
>http://itext-general.2136553.n4.nabble.com/Save-PDF-as-plain-text-tp4041246p4073263.html
>Sent from the iText - General mailing list archive at Nabble.com.
>
>------------------------------------------------------------------------------
>RSA(R) Conference 2012
>Save $700 by Nov
 18
>Register
 now
>http://p.sf.net/sfu/rsa-sfdev2dev1
>_______________________________________________
>iText-questions mailing list
>[email protected]
>https://lists.sourceforge.net/lists/listinfo/itext-questions
>
>iText(R) is a registered trademark of 1T3XT BVBA.
>Many questions posted to this list can (and will) be answered with a reference 
>to the iText book: http://www.itextpdf.com/book/
>Please check the keywords list before you ask for examples: 
>http://itextpdf.com/themes/keywords.php
>
>
>

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Re: [iText-questions] Save PDF as plain text

Reply via email to