Re: [iText-questions] Save PDF as plain text

mkl Tue, 15 Nov 2011 08:21:37 -0800

WMJ,

WMJ wrote:
> Currently the parser is event-based. I would more love to have a DOM-like
> thing.

IMO it is a good choice of the current API to work in an event based manner.

On the one hand this requires the least resources --- if it always first
transformed the page content into objects as you propose, it would eat up
much memory even if an user just looked for some minute detail which in the
event based architecture hardly requires any extra memory.

On the other hand, as Leonard already pointed out, you can easily create a
list of your PdfCommand objects in a customized event listener and, thus,
allow everyone to be happy, even your "ordinary developers" ;)

> DOM-like model is usually easier to handle for ordinary developers rather
> than the subscription event model.

As you already started designing an appropriate class family, you might want
to finish that task and contribute it...

WMJ wrote:
> With that model, the internal structure of the PDF content streams are
> easier to understand and developers won't have to create their own content
> event consuming classes to find out what font, what size or what location
> is for a specific text. They just check through the command tree, find a
> PdfShowTextCommand with the text they are interested in, and access the
> font, size, location from the PdfShowTextCommand's properties. OK, their
> jobs are done.

That sounds very easy. Unfortunately real life documents from the wild can
break such an attempt. Why do you think that the text those easy going
programmers search is displayed in one command? For some extra space it
might be split in multiple commands. Furthermore, those commands need not
immediately follow each other.

And even if you took your time to sort and combine the commands, you still
might be in trouble whenever replacement fonts enter the game.

As mentioned above, your proposed command object list sounds like an
interesting feature for you to contribute, but the current API should remain
the first stage in parsing due to performance and flexibility
considerations.

Regards,   Michael

--
View this message in context: 
http://itext-general.2136553.n4.nabble.com/Save-PDF-as-plain-text-tp4041246p4073142.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Re: [iText-questions] Save PDF as plain text

Reply via email to