RE: [iText-questions] Re: Preliminary work on PDF content extract ion

Paulo Soares Tue, 18 Nov 2003 03:59:28 -0800

In case you haven't noticed PRTokenizer can be user to parse content streams
(except inline images) and get the object types. Of course, parsing the
content is the least of the worries.


Best Regards,
Paulo Soares

> -----Original Message-----
> From: Leonard Rosenthol [SMTP:[EMAIL PROTECTED]
> Sent: Monday, November 17, 2003 23:24
> To:   Kevin Day; [EMAIL PROTECTED]
> Subject:      [iText-questions] Re: Preliminary work on PDF content
> extraction
> 
> At 9:59 AM -0700 11/17/03, Kevin Day wrote:
> >Splitting the content tokens into String objects was a very intentional
> >decision, aimed at parsing performance.
> 
>       Actually, performance isn't hindered by doing real 
> object/stack-based parsing.  I've written at least 3 PDF content 
> parsers, and they difference is minimal.
> 
> 
> >When iterating through the content
> >stream, most situations do not need to process all of the operators (or
> all
> >of the operands for that matter).
> 
>       As I said, this is a VERY naive approach to PDF parsing. 
> There is SO MUCH that can be (and is) present in PDF content that 
> your parser won't handle - starting with complex colorspaces, going 
> to marked content, and esp. things like inline images!
> 
> 
> >   If I construct an object for every single
> >operation in the low level parser, I incur the full cost of string to
> number
> >conversion, as well as the creation and immediate destruction of tons of
> >little objects.
> 
>       True, but that shouldn't be a huge deal.  Look at all of the 
> other parsers - PdfBox, JPEDAL and Multivalent. They all do it this 
> way.  They are all relative fast (for Java ;).
> 
> 
> >From a scaling perspective, I think it is better to parse these types of
> >data streams in stages, allowing higher level interpreters to decide
> whether
> >they want to incur the cost of converting a particular transformation.
> 
>       Again, it's a good theory, but it doesn't work in practice.
> 
> 
> >As an example, here is the approach that I am thinking about taking in
> the
> >next interpreter layer:
> 
>       You're too far ahead of yourself.   You REALLY need to 
> rearchitect your current parser and move away from "string 
> parameters" if you are serious about parsing PDF files.
> 
>       If you'd like to some sample documents that your parser will 
> just fall over and die on because of your design decisions, I'll be 
> glad to send you some.
> 
> 
> Leonard
> -- 
> --------------------------------------------------------------------------
> -
> Leonard Rosenthol
> <mailto:[EMAIL PROTECTED]>
> Chief Technical Officer                      <http://www.pdfsages.com>
> PDF Sages, Inc.                              215-629-3700 (voice)
>                                               215-629-0789 (fax)
> 
> 
> -------------------------------------------------------
> This SF. Net email is sponsored by: GoToMyPC
> GoToMyPC is the fast, easy and secure way to access your computer from
> any Web browser or wireless device. Click here to Try it Free!
> https://www.gotomypc.com/tr/OSDN/AW/Q4_2003/t/g22lp?Target=mm/g22lp.tmpl
> _______________________________________________
> iText-questions mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/itext-questions


-------------------------------------------------------
This SF. Net email is sponsored by: GoToMyPC
GoToMyPC is the fast, easy and secure way to access your computer from
any Web browser or wireless device. Click here to Try it Free!
https://www.gotomypc.com/tr/OSDN/AW/Q4_2003/t/g22lp?Target=mm/g22lp.tmpl
_______________________________________________
iText-questions mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/itext-questions

RE: [iText-questions] Re: Preliminary work on PDF content extract ion

Reply via email to