Re: [iText-questions] Save PDF as plain text

WMJ Fri, 18 Nov 2011 17:28:12 -0800

I've forgotten to agree that PDF can have a lot of crap. :)





>________________________________
> From: Leonard Rosenthol <[email protected]>
>Subject: Re: [iText-questions] Save PDF as plain text
> 
>
>Find some vector-heavy documents such as those in prepress/publishing or CAD 
>drawings.  Those will give you the heaviest content streams for your DOM.  
>I've enclosed TWO PAGES from a REAL WORLD document to demonstrate my point.  
>This will give you something fairly normal to implement against. When you 
>think you're "done", I'll share my favorite REAL WORLD sample that blows out 
>every DOM implementation UNTIL they prepare for it :).
>
>
>But my point is NOT the dissuade you – you are correct.  A DOM model for PDF 
>page content is a good thing and very useful.  However, it's NOT trivial to 
>implement.  You should be prepared to throw away your first implementation and 
>rewrite it after beginning to run it against stuff in the real world.  Adobe 
>Acrobat/Reader allow for LOTS of crap, because there is a LOT of crap out 
>there.  If your implementation assumes perfection, it's going to fail when 
>faced with reality.  A perfect example is your comment below about nesting – 
>what happens when "the end never comes"??   The other big thing you need to 
>work out in your DOM model is where attributes/styling goes – separate 
>objects?  Attributes on the DOM nodes?  Other?   And then how you relate them 
>from stream->DOM.
>
>
>Oh – and then once you get it working on a single page, you'll need to think 
>about how to handle recursion!  (aka how do you walk from the main page into a 
>Form Xobject?)
>
>
>Have fun!!
>
>
>Leonard
>
>From:  WMJ <[email protected]>
>Reply-To:  WMJ <[email protected]>, Post here 
><[email protected]>
>Date:  Thu, 17 Nov 2011 01:36:52 -0800
>To:  Post here <[email protected]>
>Subject:  Re: [iText-questions] Save PDF as plain text
>
>
>
>Hello,
>
>
>
>Firstly I agree that it is easy to convert the current event model to DOM 
>model. And I've done already implemented a very basic model with one or two 
>days' work.
>
>
>
>Currently I've processed quite some PDF files and I think huge page command 
>trees are rare. Few PDF documents contain page contents more than 100KB per 
>page. A DOM model is quite affordable. None to mention the fact that there are 
>already quite a lot of PDF editor or processors out there. They do have their 
>internal structure for those PDF objects to support content editing.
>
>
>
>With the DOM model mentioned above, the developers who want to extract and 
>analyze text can traverse the DOM tree and grab all PdfShowTextCommand 
>objects. By inspecting PdfShowTextCommand object, they immediately know the 
>font, size, position, color about those text pieces. A PDF rendering processor 
>named MuPDF appears to have a similar API to extract texts.
>
>
>
>
>
>
>
>And...
>
>
>
>
>
>Although we all know that the PDF commands are linear. However, according to 
>the PDF specification, there are de facto "multi-level" structures. For 
>example, text commands must be placed within a pair of BT and ET command, and 
>a pair of q and Q command encompasses graphic commands within a scope. In the 
>DOM model, we don't need to worry about "whether I've added an ET command 
>after the BT or not". A PdfTextAreaCommand denotes the BT and implies ET after 
>all its sub-commands. Sub-commands of the PdfTextAreaCommand can be 
>PdfTextMatrixCommmand, PdfShowTextCommand, PdfFontCommand, etc.
>
>
>
>
>
>We might need listen to other people's opinions and requirements on PDF 
>content processing.
>
>A job that is easy to do doesn't mean that it is a nonsense. If integrating it 
>into iText can save other programmers' days, doing such kind of low-tech jobs 
>may be meaningful indeed.
>
>
>I am currently experimenting on the PDF page command DOM model (I need support 
>above the font encoding, font subsetting, and more, and more aspects that 
>iText lacks). A good thing about the DOM model is that we don't have to create 
>many small classes to consume the PDF command events. A single class may do a 
>variety of jobs against the same amount of content.I am trying to program an 
>application to filtering out unwanted parts, or batch modifying parts in PDF 
>pages. Event model is not so sufficient or effective when doing this. I may 
>try to find out more and improve the design.
>
>WMJ.
>
>
>
>
>>________________________________
>>From: Kevin Day <[email protected]>
>>
>>hmmmm...  Well that was certainly part of the original design consideration. 
>>But when you are processing a stack based operator stream, and you have the
>>potential for huge streams, an event based handler makes
 the most sense from
>>an implementation perspective.  As others are sure to point out, creating a
>>DOM from the event model is actually not that hard to do.  Heck,
>>LocationTextExtractionStrategy effectively does this as it accumulates text
>>operations (it's a pretty flat DOM, but that could be extended).
>>
>>At the end of the day, what we have heard from users is that they want to
>>get text extracted from the page.  Not access to every single draw
>>operation...  But there certainly could be use cases that aren't being
>>considered.
>>
>>I think that it's also important to recognize that the PDF format doesn't
>>lend itself to rich, multi-level data structures.  For example, you outline
>>the concept of sub-nodes in your sample code.  What exactly would those
>>sub-nodes contain?  If you are expecting to see a DOM that consists of
>>pages, paragraphs, sentences and words, I think you may be asking
 for
>>something that PDF doesn't support.
>>
>>
>>So, how do you envision using the information that is in the DOM structure
>>that you describe?  And how much state do you want to capture in every node?
>>
>>I could absolutely see an enhancement to LocationTextStrategy that would
>>return a DOM of some sort (or at least a "rich" string - which would
>>effectively be a DOM, instead of just a string - this has been the intent of
>>the *Strategy objects from day one.
>>
>>
>>
>>
>>
>>--
>>View this message in context: 
>>http://itext-general.2136553.n4.nabble.com/Save-PDF-as-plain-text-tp4041246p4073263.html
>>Sent from the iText - General mailing list archive at Nabble.com.
>>
>>------------------------------------------------------------------------------
>>RSA(R) Conference 2012
>>Save $700 by Nov
 18
>>Register
 now
>>http://p.sf.net/sfu/rsa-sfdev2dev1
>>_______________________________________________
>>iText-questions mailing list
>>[email protected]
>>https://lists.sourceforge.net/lists/listinfo/itext-questions
>>
>>iText(R) is a registered trademark of 1T3XT BVBA.
>>Many questions posted to this list can (and will) be answered with a 
>>reference to the iText book: http://www.itextpdf.com/book/
>>Please check the keywords list before you ask for examples: 
>>http://itextpdf.com/themes/keywords.php
>>
>>
>>
>
>

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Re: [iText-questions] Save PDF as plain text

Reply via email to