Re: [iText-questions] Save PDF as plain text

WMJ Thu, 17 Nov 2011 17:42:39 -0800

Hello,


Thanks for pointing out this.


Before that, I have once met with a PDF which has 1 mega bytes each compressed 
page stream. It did not take too long to parse them and convert it to the 
primitive command model.


If ET does not come after a BT, the DOM model might assume the rest commands 
are for the text object. If enclosing commands interleave each other, it is an 
error. The DOM model should not tolerate this, and the current event model 
can't cope with this either. We can't expect too much for the first design. At 
least, handling correct documents is the initial goal.


Yes, where the styles goes is a problem. It is good to discuss and find out a 
way to handle them.


I think it is useful to have a PDF command model. Just like XML DOM. XML DOM is 
not so efficient while encountering huge XML documents. But it is still a very 
popular tool for ordinary size documents. The real problem is not whether we 
should have one or not, but how to design the command model to make it useful 
and easy to use. Afterwards, we provide the two models and let developers 
choose their favorite one.


WMJ.




>________________________________
>From: Leonard Rosenthol <[email protected]>
>Subject: Re: [iText-questions] Save PDF as plain text
>
>
>Find some vector-heavy documents such as those in prepress/publishing or CAD 
>drawings.  Those will give you the heaviest content streams for your DOM.  
>I've enclosed TWO PAGES from a REAL WORLD document to demonstrate my point.  
>This will give you something fairly normal to implement against. When you 
>think you're "done", I'll share my favorite REAL WORLD sample that blows out 
>every DOM implementation UNTIL they prepare for it :).
>
>
>But my point is NOT the dissuade you – you are correct.  A DOM model for PDF 
>page content is a good thing and very useful.  However, it's NOT trivial to 
>implement.  You should be prepared to throw away your first implementation and 
>rewrite it after beginning to run it against stuff in the real world.  Adobe 
>Acrobat/Reader allow for LOTS of crap, because there is a LOT of crap out 
>there.  If your implementation assumes perfection, it's going to fail when 
>faced with reality.  A perfect example is your comment below about nesting – 
>what happens when "the end never comes"??   The other big thing you need to 
>work out in your DOM model is where attributes/styling goes – separate 
>objects?  Attributes on the DOM nodes?  Other?   And then how you relate them 
>from stream->DOM.
>
>
>Oh – and then once you get it working on a single page, you'll need to think 
>about how to handle recursion!  (aka how do you walk from the main page into a 
>Form Xobject?)
>
>
>Have fun!!
>
>
>Leonard
>
>From:  WMJ <[email protected]>
>Reply-To:  WMJ <[email protected]>, Post here 
><[email protected]>
>Date:  Thu, 17 Nov 2011 01:36:52 -0800
>To:  Post here <[email protected]>
>Subject:  Re: [iText-questions] Save PDF as plain text
>
>
>
>Hello,
>
>
>
>Firstly I agree that it is easy to convert the current event model to DOM 
>model. And I've done already implemented a very basic model with one or two 
>days' work.
>
>
>
>Currently I've processed quite some PDF files and I think huge page command 
>trees are rare. Few PDF documents contain page contents more than 100KB per 
>page. A DOM model is quite affordable. None to mention the fact that there are 
>already quite a lot of PDF editor or processors out there. They do have their 
>internal structure for those PDF objects to support content editing.
>
>
>
>With the DOM model mentioned above, the developers who want to extract and 
>analyze text can traverse the DOM tree and grab all PdfShowTextCommand 
>objects. By inspecting PdfShowTextCommand object, they immediately know the 
>font, size, position, color about those text pieces. A PDF rendering processor 
>named MuPDF appears to have a similar API to extract texts.
>
>
>
>
>
>
>
>And...
>
>
>
>
>
>Although we all know that the PDF commands are linear. However, according to 
>the PDF specification, there are de facto "multi-level" structures. For 
>example, text commands must be placed within a pair of BT and ET command, and 
>a pair of q and Q command encompasses graphic commands within a scope. In the 
>DOM model, we don't need to worry about "whether I've added an ET command 
>after the BT or not". A PdfTextAreaCommand denotes the BT and implies ET after 
>all its sub-commands. Sub-commands of the PdfTextAreaCommand can be 
>PdfTextMatrixCommmand, PdfShowTextCommand, PdfFontCommand, etc.
>
>
>
>
>
>We might need listen to other people's opinions and requirements on PDF 
>content processing.
>
>A job that is easy to do doesn't mean that it is a nonsense. If integrating it 
>into iText can save other programmers' days, doing such kind of low-tech jobs 
>may be meaningful indeed.
>
>
>I am currently experimenting on the PDF page command DOM model (I need support 
>above the font encoding, font subsetting, and more, and more aspects that 
>iText lacks). A good thing about the DOM model is that we don't have to create 
>many small classes to consume the PDF command events. A single class may do a 
>variety of jobs against the same amount of content.I am trying to program an 
>application to filtering out unwanted parts, or batch modifying parts in PDF 
>pages. Event model is not so sufficient or effective when doing this. I may 
>try to find out more and improve the design.
>
>WMJ.
>
>
>
>
>>________________________________
>>From: Kevin Day <[email protected]>
>>
>>hmmmm...  Well that was certainly part of the original design consideration. 
>>But when you are processing a stack based operator stream, and you have the
>>potential for huge streams, an event based handler makes
 the most sense from
>>an implementation perspective.  As others are sure to point out, creating a
>>DOM from the event model is actually not that hard to do.  Heck,
>>LocationTextExtractionStrategy effectively does this as it accumulates text
>>operations (it's a pretty flat DOM, but that could be extended).
>>
>>At the end of the day, what we have heard from users is that they want to
>>get text extracted from the page.  Not access to every single draw
>>operation...  But there certainly could be use cases that aren't being
>>considered.
>>
>>I think that it's also important to recognize that the PDF format doesn't
>>lend itself to rich, multi-level data structures.  For example, you outline
>>the concept of sub-nodes in your sample code.  What exactly would those
>>sub-nodes contain?  If you are expecting to see a DOM that consists of
>>pages, paragraphs, sentences and words, I think you may be asking
 for
>>something that PDF doesn't support.
>>
>>
>>So, how do you envision using the information that is in the DOM structure
>>that you describe?  And how much state do you want to capture in every node?
>>
>>I could absolutely see an enhancement to LocationTextStrategy that would
>>return a DOM of some sort (or at least a "rich" string - which would
>>effectively be a DOM, instead of just a string - this has been the intent of
>>the *Strategy objects from day one.
>>
>>
>>
>>
>>
>>--
>>View this message in context: 
>>http://itext-general.2136553.n4.nabble.com/Save-PDF-as-plain-text-tp4041246p4073263.html
>>Sent from the iText - General mailing list archive at Nabble.com.
>>
>>------------------------------------------------------------------------------
>>RSA(R) Conference 2012
>>Save $700 by Nov
 18
>>Register
 now
>>http://p.sf.net/sfu/rsa-sfdev2dev1
>>_______________________________________________
>>iText-questions mailing list
>>[email protected]
>>https://lists.sourceforge.net/lists/listinfo/itext-questions
>>
>>iText(R) is a registered trademark of 1T3XT BVBA.
>>Many questions posted to this list can (and will) be answered with a 
>>reference to the iText book: http://www.itextpdf.com/book/
>>Please check the keywords list before you ask for examples: 
>>http://itextpdf.com/themes/keywords.php
>>
>>
>>
>
>

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Re: [iText-questions] Save PDF as plain text

Reply via email to