Re: [iText-questions] Save PDF as plain text

WMJ Tue, 15 Nov 2011 07:34:07 -0800

Hello Kevin,

Currently the parser is event-based. I would more love to have a DOM-like thing.

For example,

q

BT

1 0 0 1 12 12 Tm

/F1 12 Tf %F1 is a font named Times now Roman

(abcdef) Tj

ET

Q

The above commands can be parsed to structured content like this:

<gs>

  <text>

    <matrix value="..."/>

    <font name="Times new roman" resName="F1" size="12" />

    <print>abcedf</print>

  </text>

</gs>

The actual implementation of the parsed objects might not necessarily be XML. 
It can be classes, for example, an abstract class will have the common 
structure of the PDF commands:

public abstract class PdfCommand {
  public readonly property PdfCommandType Type, // gs, text, textMatrix, etc.
  public readonly property List<Parameters> Parameters, // readonly list, but 
contents of the list can be modified
  public readonly property List<PdfCommand> SubCommands // maybe this should be 
placed within a subclass such as PdfEnclosingCommand
}

Concrete classes may expose more information about a command. For example, the 
Tj command can be a special class which inherits from PdfCommand
public class PdfShowTextCommand : PdfCommand {
//  extended properties:
  public readonly TextGraphicState // a class which contains Font, Text, 
TextMatrix and other information about the text to be shown with the 
PdfShowTextCommand. Certainly those information should be gathered before the 
parser's meeting with the Tj command.
}

DOM-like model is usually easier to handle for ordinary developers rather than 
the subscription event model.
With that model, the internal structure of the PDF content streams are easier 
to understand and developers won't have to create their own content event 
consuming classes to find out what font, what size or what location is for a 
specific text. They just check through the command tree, find a 
PdfShowTextCommand with the text they are interested in, and access the font, 
size, location from the PdfShowTextCommand's properties. OK, their jobs are 
done.

It is also possible to reconstruct the content stream from the DOM-like 
PdfCommand models. Of course, it won't be as easy as we create the document 
with iText or modifying the content with the PdfStamper.

Another enhancement can be the ability to decode text without ToUnicode section 
in the font.

WMJ

>________________________________
>
>WMJ wrote:
>> 
>> The parser is not very powerful or convenient yet, but it does point you
>> to the most detailed part of the PDF text.
>> 
>
>WMJ - what would you see as enhancements that would make it more powerful or
>convenient?  Understanding the details of this will help us improve. 
>Thanks.
>
>--
>

------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Re: [iText-questions] Save PDF as plain text

Reply via email to