I've forgotten to agree that PDF can have a lot of crap. :)
>________________________________
> From: Leonard Rosenthol <[email protected]>
>Subject: Re: [iText-questions] Save PDF as plain text
>
>
>Find some vector-heavy documents such as those in prepress/publishing or CAD
>drawings. Those will give you the heaviest content streams for your DOM.
>I've enclosed TWO PAGES from a REAL WORLD document to demonstrate my point.
>This will give you something fairly normal to implement against. When you
>think you're "done", I'll share my favorite REAL WORLD sample that blows out
>every DOM implementation UNTIL they prepare for it :).
>
>
>But my point is NOT the dissuade you – you are correct. A DOM model for PDF
>page content is a good thing and very useful. However, it's NOT trivial to
>implement. You should be prepared to throw away your first implementation and
>rewrite it after beginning to run it against stuff in the real world. Adobe
>Acrobat/Reader allow for LOTS of crap, because there is a LOT of crap out
>there. If your implementation assumes perfection, it's going to fail when
>faced with reality. A perfect example is your comment below about nesting –
>what happens when "the end never comes"?? The other big thing you need to
>work out in your DOM model is where attributes/styling goes – separate
>objects? Attributes on the DOM nodes? Other? And then how you relate them
>from stream->DOM.
>
>
>Oh – and then once you get it working on a single page, you'll need to think
>about how to handle recursion! (aka how do you walk from the main page into a
>Form Xobject?)
>
>
>Have fun!!
>
>
>Leonard
>
>From: WMJ <[email protected]>
>Reply-To: WMJ <[email protected]>, Post here
><[email protected]>
>Date: Thu, 17 Nov 2011 01:36:52 -0800
>To: Post here <[email protected]>
>Subject: Re: [iText-questions] Save PDF as plain text
>
>
>
>Hello,
>
>
>
>Firstly I agree that it is easy to convert the current event model to DOM
>model. And I've done already implemented a very basic model with one or two
>days' work.
>
>
>
>Currently I've processed quite some PDF files and I think huge page command
>trees are rare. Few PDF documents contain page contents more than 100KB per
>page. A DOM model is quite affordable. None to mention the fact that there are
>already quite a lot of PDF editor or processors out there. They do have their
>internal structure for those PDF objects to support content editing.
>
>
>
>With the DOM model mentioned above, the developers who want to extract and
>analyze text can traverse the DOM tree and grab all PdfShowTextCommand
>objects. By inspecting PdfShowTextCommand object, they immediately know the
>font, size, position, color about those text pieces. A PDF rendering processor
>named MuPDF appears to have a similar API to extract texts.
>
>
>
>
>
>
>
>And...
>
>
>
>
>
>Although we all know that the PDF commands are linear. However, according to
>the PDF specification, there are de facto "multi-level" structures. For
>example, text commands must be placed within a pair of BT and ET command, and
>a pair of q and Q command encompasses graphic commands within a scope. In the
>DOM model, we don't need to worry about "whether I've added an ET command
>after the BT or not". A PdfTextAreaCommand denotes the BT and implies ET after
>all its sub-commands. Sub-commands of the PdfTextAreaCommand can be
>PdfTextMatrixCommmand, PdfShowTextCommand, PdfFontCommand, etc.
>
>
>
>
>
>We might need listen to other people's opinions and requirements on PDF
>content processing.
>
>A job that is easy to do doesn't mean that it is a nonsense. If integrating it
>into iText can save other programmers' days, doing such kind of low-tech jobs
>may be meaningful indeed.
>
>
>I am currently experimenting on the PDF page command DOM model (I need support
>above the font encoding, font subsetting, and more, and more aspects that
>iText lacks). A good thing about the DOM model is that we don't have to create
>many small classes to consume the PDF command events. A single class may do a
>variety of jobs against the same amount of content.I am trying to program an
>application to filtering out unwanted parts, or batch modifying parts in PDF
>pages. Event model is not so sufficient or effective when doing this. I may
>try to find out more and improve the design.
>
>WMJ.
>
>
>
>
>>________________________________
>>From: Kevin Day <[email protected]>
>>
>>hmmmm... Well that was certainly part of the original design consideration.
>>But when you are processing a stack based operator stream, and you have the
>>potential for huge streams, an event based handler makes
the most sense from
>>an implementation perspective. As others are sure to point out, creating a
>>DOM from the event model is actually not that hard to do. Heck,
>>LocationTextExtractionStrategy effectively does this as it accumulates text
>>operations (it's a pretty flat DOM, but that could be extended).
>>
>>At the end of the day, what we have heard from users is that they want to
>>get text extracted from the page. Not access to every single draw
>>operation... But there certainly could be use cases that aren't being
>>considered.
>>
>>I think that it's also important to recognize that the PDF format doesn't
>>lend itself to rich, multi-level data structures. For example, you outline
>>the concept of sub-nodes in your sample code. What exactly would those
>>sub-nodes contain? If you are expecting to see a DOM that consists of
>>pages, paragraphs, sentences and words, I think you may be asking
for
>>something that PDF doesn't support.
>>
>>
>>So, how do you envision using the information that is in the DOM structure
>>that you describe? And how much state do you want to capture in every node?
>>
>>I could absolutely see an enhancement to LocationTextStrategy that would
>>return a DOM of some sort (or at least a "rich" string - which would
>>effectively be a DOM, instead of just a string - this has been the intent of
>>the *Strategy objects from day one.
>>
>>
>>
>>
>>
>>--
>>View this message in context:
>>http://itext-general.2136553.n4.nabble.com/Save-PDF-as-plain-text-tp4041246p4073263.html
>>Sent from the iText - General mailing list archive at Nabble.com.
>>
>>------------------------------------------------------------------------------
>>RSA(R) Conference 2012
>>Save $700 by Nov
18
>>Register
now
>>http://p.sf.net/sfu/rsa-sfdev2dev1
>>_______________________________________________
>>iText-questions mailing list
>>[email protected]
>>https://lists.sourceforge.net/lists/listinfo/itext-questions
>>
>>iText(R) is a registered trademark of 1T3XT BVBA.
>>Many questions posted to this list can (and will) be answered with a
>>reference to the iText book: http://www.itextpdf.com/book/
>>Please check the keywords list before you ask for examples:
>>http://itextpdf.com/themes/keywords.php
>>
>>
>>
>
>
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples:
http://itextpdf.com/themes/keywords.php