Leonard, your response, while partially correct, is grossly misleading.
My procedure does not require gathering up and font encoding information at
all, as all I'm doing is extracting the raw plain text fragments in the order
in which they appear in the PS file. No decoding/decryption of any data streams
is required in my procedure, as the PDF reader/viewer does that for me.
Printing has nothing to do with the issue, except that in my procedure, I've
used printing to PS as the means to convert the possibly encrypted PDF into
scannable plain text in a PS file.
You were correct in inferring that my procedure could be applied to a PDF file,
thus skipping the print to PS step, but only for PDFs that are not encrypted.
Finally, I get the impression you're engaging in a debate that you think/hope
you can win. This is one where there's no win possible, so I'd prefer you just
drop it.
Best regards,
Bill Segraves
-------------- Original message from Leonard Rosenthol <[EMAIL PROTECTED]>:
--------------
Since text extraction from PDF (and PostScript) REQUIRES that you parse the
file format in question in order to gather up font encoding information (since
in most cases the text won't appear as readable text anyway), that also means
that you have to decode (and/or decrypt) any data streams - whether for
printing or simple extraction.
Application of regex on PDF (or PostScript) content is not only complex in and
of itself, given the font encoding situations BUT (as you know) also made
complex by the "splitting up" of content across multiple operations and
potentially not in logical order.
Leonard
On Oct 7, 2007, at 9:04 PM, William A. Segraves wrote:
Good question, Leonard.
If someone were to attempt to use the approach I recommended to extract text
from a locked and encrypted PDF, rather than from the PS print file, he/she
would be doomed to failure, as the regex engine that would be used in the
match/substitution would be unable to find the target text fragments.
The context of the OP's question suggested he was trying to read text content
from a PDF with a program. Naturally, the procedure I recommended requires the
user to be able to open the PDF with Reader and print it to a PS file.
I don't understand your question about "read it to just read it" in the context
of the OP's question. I think we're talking about text extraction. Please
rephrase your question.
BTW, this issue was thoroughly discussed on comp.text.pdf about 4-5 years ago.
Cheers,
Bill Segraves
----- Original Message -----
From: Leonard Rosenthol
To: Post all your questions about iText here
Sent: Sunday, October 07, 2007 6:57 PM
Subject: Re: [iText-questions] u guys konw how to read the data from pdfusing
java itext ?
Why would using this approach fail on a PDF that is locked with a
password? In order to print the PDF, you have to have the ability to
READ it ;). And if you can read it in order to print it, you can
read it to just read it...Right?
Leonard
On Oct 7, 2007, at 11:05 AM, [EMAIL PROTECTED] wrote:
> If the PDF is locked with a password, but still printable, the
> approach offered by this author is one that would work, while
> attempting to use this approach on the original PDF would fail.
> This author was simply trying to help the poster with an approach
> that would avoid the frustration that would ensue if he tried to
> work with an original locked PDF.
>
>
> Of course, the approach espoused by the esteemed sage would be
> easier, for both unlocked and unlocked PDFs. OTOH, this author
> doesn't count easier to fail as an acceptable approach.
>
>
> Cheers,
>
> Bill Segraves
>
> -------------- Original message from Leonard Rosenthol
> <[EMAIL PROTECTED]>: --------------
>
>
> > Why would working through the PostScript be easier than doing
> this on
> > the original PDF?
> >
> > You can get to all the PDF operators just fine.
> > Font & text information is more easily referenceable from the PDF
> > PostScript also has "XObjects", Patterns, etc. that may contain
> text.
> > etc.
> >
> > Not understanding the logic :(.
> >
> > Leonard
> >
> >
> > On Oct 6, 2007, at 4:53 PM, [EMAIL PROTECTED] wrote:
> >
> > > Yes; but it is not practicable with iText. You could, however, as
> > > long as the PDF is printable, use the following procedure:
> > >
> > > 1. Print to a PS file.
> > >
> > > 2. Scan the PS file from step1 above, droppin g all lines that
> > > do not end with Tj or TJ.
> > >
> > > 3. Use a regular expression (together with Substitution or
> > > Match) to extract the instances of "text fragment" from within
> > > multiple instances of "(text fragment)Tj", printing the resulting
> > > text fragments to STDOUT.
> > >
> > > Bruno has given an excellent example of why you should not expect
> > > the resulting output to make sense, i.e., the text fragments may
> > > not appear in the order in which you'd like for them to appear.
> > >
> > > Cheers,
> > >
> > > Bill Segraves
> > >
> > > -------------- Original message from krammark
> > > : --------------
> > >
> > >
> > > >
> > > > so , how we read the data from pdf ?
> > > > i mean , can we read them line by line from the specific pages ?
> > & gt; &g t;
> > > > thanks buddy.
> > > >
> > > >
> > > > Bruno Lowagie (iText) wrote:
> > > > >
> > > > > krammark wrote:
> > > > >> hey gusy,
> > > > >> do u guys have a idea how to read the data from pdf pages
> > > using itext ?
> > > > >> basically, i want to read the data from table and write them
> > > into excel
> > > > >> files.
> > > > >> is that possible ?
> > > > >
> > > > > There is no such thing as 'a table' in plain PDF.
> > > > > It's just lines and words painted on a canvas,
> > > > > possible in an arbitrary order.
> > > > >
> > > > > Unless your tables cells are form fields, or your
> > > > ; > PDF contains specific table structures (Tagged PDF),
> > > > > iText probably won't help you.
> & gt; > > >
> > > > > br,
> > > > > Bruno
> > > > >
> > > > >
> > >
> ----------------------------------------------------------------------
> > > ---
> > > > > This SF.net email is sponsored by: Splunk Inc.
> > > > > Still grepping through log files to find problems? Stop.
> > > > > Now Search log events and configuration files using AJAX and a
> > > browser.
> > > > > Download your FREE copy of Splunk now >> http://
> get.splunk.com/
> > > > > _______________________________________________
> > > > > iText-questions mailing list
> > > > > iText-questions@lists.sourceforge.net
> > > > > https://lists.sourceforge.net/lists/listinfo/itext-questions
> > > > > Buy the iText book: http://itext.ugent.be/itext-in-action/
> > > > >
> > > > >
> > > >
> > > > --
> > > > View this message in context:
> > > > http://www.nabble.com/u-guys-konw -how-t o-read-the-data-from-
> pdf-
> > > using-java-itext
> > > > ---tf4572506.html#a13067937
> > > > Sent from the iText - General mailing list archive at
> Nabble.com.
> > > >
> > > >
> > > >
> > >
> ----------------------------------------------------------------------
> > > ---
> > > > This SF.net email is sponsored by: Splunk Inc.
> > > > Still grepping through log files to find problems? Stop.
> > > > Now Search log events and configuration files using AJAX and a
> > > browser.
> > > > Download your FREE copy of Splunk now >> http://get.splunk.com/
> > > > _______________________________________________
> > > > iText-questions mailing list
> > > > [EMAIL PROTECTED] rge.ne t
> > > > https://lists.sourceforge.net/lists/listinfo/itext-questions
> > > > Buy the iText book: http://itext.ugent.be/itext-in-action/
> > >
> ----------------------------------------------------------------------
> > > ---
> > > This SF.net email is sponsored by: Splunk Inc.
> > > Still grepping through log files to find problems? Stop.
> > > Now Search log events and configuration files using AJAX and a
> > > browser.
> > > Download your FREE copy of Splunk now >> http://get.splunk.com/
> > > _______________________________________________
> > > iText-questions mailing list
> > > iText-questions@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/itext-questions
> > > Buy the iText book: http://itext.ugent.be/itext-in-action/
> >
> >
> >
> ----------------------------------------------------------------------
> ---
> & gt; Th is SF.net email is sponsored by: Splunk Inc.
> > Still grepping through log files to find problems? Stop.
> > Now Search log events and configuration files using AJAX and a
> browser.
> > Download your FREE copy of Splunk now >> http://get.splunk.com/
> > _______________________________________________
> > iText-questions mailing list
> > iText-questions@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/itext-questions
> > Buy the iText book: http://itext.ugent.be/itext-in-action/
> ----------------------------------------------------------------------
> ---
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems? Stop.
> Now Search log events and configuration files using AJAX and a
> browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> _______________________________________________
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
> Buy the iText book: http://itext.ugent.be/itext-in-action/
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>
http://get.splunk.com/_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/