Oops! I forgot to respond to the second paragraph in your response.

In the second paragraph, it seems you are arguing too strongly for your own 
limitations.

The procedure I outlined involves only rudimentary parsing, matching, and 
substitution, and printing of text fragments to STDOUT, in the order in which 
they are found. What could be simpler than parsing lines for instances of

     (some text here)Tj

or

     (some more text here)TJ

and on finding them, extracting the embedded text fragments, and printing them 
to STDOUT? My Perl script was only a few lines of code. I expect I could easily 
rewrite it in Java for incorporation in an iText tool. I know I could write the 
script as a shell script. In fact, I think it could be written as a one-liner. 
In summary, the procedure I outlined could be coded very simply in most any 
programmers favorite scripting language, provided the language has a built-in 
or otherwise available regex engine. 

The "pattern" can be as robust as you care to make it, i.e., characters only, 
characters + punctuation marks, etc.

Bruno and I both cautioned readers carefully about the possibility of text 
fragments not appearing in the correct order, i.e., the order in which they can 
be seen in the displayed PDF, together with reasons why this is so. The reasons 
are well known, so no further discussion on "order" is necessary or desired.

The procedure I outlined was developed and tested about 5 years ago. The 
results from extracting the text from a part of the Ghostscript User Manual for 
which the PS (.prn, in my case) file that was 353KB (7268L, 361091C) resulting 
in a plain text file that was 1.75KB (30L, 1799C). Clearly, the procedure I 
outlined has potential for saving a lot of time for someone who needs to 
extract text content from a PDF, whether the text is extracted directly from a 
PDF file, when feasible, or from a PS print file, when extraction from the PDF 
is not feasible.

Finally, readers who wish to extract text from a PDF should check to see if the 
text can be copied and pasted directly from the PDF viewer or if the viewer has 
a built-in means to export the text content.

Best regards,
Bill Segraves
-------------- Original message from Leonard Rosenthol <[EMAIL PROTECTED]>: 
-------------- 


Since text extraction from PDF (and PostScript) REQUIRES that you parse the 
file format in question in order to gather up font encoding information (since 
in most cases the text won't appear as readable text anyway), that also means 
that you have to decode (and/or decrypt) any data streams - whether for 
printing or simple extraction.


Application of regex on PDF (or PostScript) content is not only complex in and 
of itself, given the font encoding situations BUT (as you know) also made 
complex by the "splitting up" of content across multiple operations and 
potentially not in logical order.


Leonard




On Oct 7, 2007, at 9:04 PM, William A. Segraves wrote:


Good question, Leonard.
If someone were to attempt to use the approach I recommended to extract text 
from a locked and encrypted PDF, rather than from the PS print file, he/she 
would be doomed to failure, as the regex engine that would be used in the 
match/substitution would be unable to find the target text fragments.
The context of the OP's question suggested he was trying to read text content 
from a PDF with a program. Naturally, the procedure I recommended requires the 
user to be able to open the PDF with Reader and print it to a PS file.
I don't understand your question about "read it to just read it" in the context 
of the OP's question. I think we're talking about text extraction. Please 
rephrase your question.
BTW, this issue was thoroughly discussed on comp.text.pdf about 4-5 years ago.
Cheers,
Bill Segraves
----- Original Message ----- 
From: Leonard Rosenthol 
To: Post all your questions about iText here 
Sent: Sunday, October 07, 2007 6:57 PM
Subject: Re: [iText-questions] u guys konw how to read the data from pdfusing 
java itext ?


Why would using this approach fail on a PDF that is locked with a 
password? In order to print the PDF, you have to have the ability to 
READ it ;). And if you can read it in order to print it, you can 
read it to just read it...Right?

Leonard


On Oct 7, 2007, at 11:05 AM, [EMAIL PROTECTED] wrote:

> If the PDF is locked with a password, but still printable, the 
> approach offered by this author is one that would work, while 
> attempting to use this approach on the original PDF would fail. 
> This author was simply trying to help the poster with an approach 
> that would avoid the frustration that would ensue if he tried to 
> work with an original locked PDF.
>
>
> Of course, the approach espoused by the esteemed sage would be 
> easier, for both unlocked and unlocked PDFs. OTOH, this author 
> doesn't count easier to fail as an acceptable approach.
>
>
> Cheers,
>
> Bill Segraves
>
> -------------- Original message from Leonard Rosenthol 
> <[EMAIL PROTECTED]>: --------------
>
>
> > Why would working through the PostScript be easier than doing 
> this on
> > the original PDF?
> >
> > You can get to all the PDF operators just fine.
> > Font & text information is more easily referenceable from the PDF
> > PostScript also has "XObjects", Patterns, etc. that may contain 
> text.
> > etc.
> >
> > Not understanding the logic :(.
> >
> > Leonard
> >
> >
> > On Oct 6, 2007, at 4:53 PM, [EMAIL PROTECTED] wrote:
> >
> > > Yes; but it is not practicable with iText. You could, however, as
> > > long as the PDF is printable, use the following procedure:
> > >
> > > 1. Print to a PS file.
> > >
> > > 2. Scan the PS file from step1 above, droppin g all lines that
> > > do not end with Tj or TJ.
> > >
> > > 3. Use a regular expression (together with Substitution or
> > > Match) to extract the instances of "text fragment" from within
> > > multiple instances of "(text fragment)Tj", printing the resulting
> > > text fragments to STDOUT.
> > >
> > > Bruno has given an excellent example of why you should not expect
> > > the resulting output to make sense, i.e., the text fragments may
> > > not appear in the order in which you'd like for them to appear.
> > >
> > > Cheers,
> > >
> > > Bill Segraves
> > >
> > > -------------- Original message from krammark
> > > : --------------
> > >
> > >
> > > >
> > > > so , how we read the data from pdf ?
> > > > i mean , can we read them line by line from the specific pages ?
> > & gt; &g t;
> > > > thanks buddy.
> > > >
> > > >
> > > > Bruno Lowagie (iText) wrote:
> > > > >
> > > > > krammark wrote:
> > > > >> hey gusy,
> > > > >> do u guys have a idea how to read the data from pdf pages
> > > using itext ?
> > > > >> basically, i want to read the data from table and write them
> > > into excel
> > > > >> files.
> > > > >> is that possible ?
> > > > >
> > > > > There is no such thing as 'a table' in plain PDF.
> > > > > It's just lines and words painted on a canvas,
> > > > > possible in an arbitrary order.
> > > > >
> > > > > Unless your tables cells are form fields, or your
> > > > ; > PDF contains specific table structures (Tagged PDF),
> > > > > iText probably won't help you.
> & gt; > > >
> > > > > br,
> > > > > Bruno
> > > > >
> > > > >
> > > 
> ----------------------------------------------------------------------
> > > ---
> > > > > This SF.net email is sponsored by: Splunk Inc.
> > > > > Still grepping through log files to find problems? Stop.
> > > > > Now Search log events and configuration files using AJAX and a
> > > browser.
> > > > > Download your FREE copy of Splunk now >> http:// 
> get.splunk.com/
> > > > > _______________________________________________
> > > > > iText-questions mailing list
> > > > > iText-questions@lists.sourceforge.net
> > > > > https://lists.sourceforge.net/lists/listinfo/itext-questions
> > > > > Buy the iText book: http://itext.ugent.be/itext-in-action/
> > > > >
> > > > >
> > > >
> > > > --
> > > > View this message in context:
> > > > http://www.nabble.com/u-guys-konw -how-t o-read-the-data-from- 
> pdf-
> > > using-java-itext
> > > > ---tf4572506.html#a13067937
> > > > Sent from the iText - General mailing list archive at 
> Nabble.com.
> > > >
> > > >
> > > >
> > > 
> ----------------------------------------------------------------------
> > > ---
> > > > This SF.net email is sponsored by: Splunk Inc.
> > > > Still grepping through log files to find problems? Stop.
> > > > Now Search log events and configuration files using AJAX and a
> > > browser.
> > > > Download your FREE copy of Splunk now >> http://get.splunk.com/
> > > > _______________________________________________
> > > > iText-questions mailing list
> > > > [EMAIL PROTECTED] rge.ne t
> > > > https://lists.sourceforge.net/lists/listinfo/itext-questions
> > > > Buy the iText book: http://itext.ugent.be/itext-in-action/
> > > 
> ----------------------------------------------------------------------
> > > ---
> > > This SF.net email is sponsored by: Splunk Inc.
> > > Still grepping through log files to find problems? Stop.
> > > Now Search log events and configuration files using AJAX and a
> > > browser.
> > > Download your FREE copy of Splunk now >> http://get.splunk.com/
> > > _______________________________________________
> > > iText-questions mailing list
> > > iText-questions@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/itext-questions
> > > Buy the iText book: http://itext.ugent.be/itext-in-action/
> >
> >
> > 
> ---------------------------------------------------------------------- 
> ---
> & gt; Th is SF.net email is sponsored by: Splunk Inc.
> > Still grepping through log files to find problems? Stop.
> > Now Search log events and configuration files using AJAX and a 
> browser.
> > Download your FREE copy of Splunk now >> http://get.splunk.com/
> > _______________________________________________
> > iText-questions mailing list
> > iText-questions@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/itext-questions
> > Buy the iText book: http://itext.ugent.be/itext-in-action/
> ---------------------------------------------------------------------- 
> ---
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems? Stop.
> Now Search log events and configuration files using AJAX and a 
> browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/ 
> _______________________________________________
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
> Buy the iText book: http://itext.ugent.be/itext-in-action/


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> 
http://get.splunk.com/_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/

Reply via email to