Oops! I forgot to respond to the second paragraph in your response.
In the second paragraph, it seems you are arguing too strongly for your own
limitations.
The procedure I outlined involves only rudimentary parsing, matching, and
substitution, and printing of text fragments to STDOUT, in the order in which
they are found. What could be simpler than parsing lines for instances of
(some text here)Tj
or
(some more text here)TJ
and on finding them, extracting the embedded text fragments, and printing them
to STDOUT? My Perl script was only a few lines of code. I expect I could easily
rewrite it in Java for incorporation in an iText tool. I know I could write the
script as a shell script. In fact, I think it could be written as a one-liner.
In summary, the procedure I outlined could be coded very simply in most any
programmers favorite scripting language, provided the language has a built-in
or otherwise available regex engine.
The "pattern" can be as robust as you care to make it, i.e., characters only,
characters + punctuation marks, etc.
Bruno and I both cautioned readers carefully about the possibility of text
fragments not appearing in the correct order, i.e., the order in which they can
be seen in the displayed PDF, together with reasons why this is so. The reasons
are well known, so no further discussion on "order" is necessary or desired.
The procedure I outlined was developed and tested about 5 years ago. The
results from extracting the text from a part of the Ghostscript User Manual for
which the PS (.prn, in my case) file that was 353KB (7268L, 361091C) resulting
in a plain text file that was 1.75KB (30L, 1799C). Clearly, the procedure I
outlined has potential for saving a lot of time for someone who needs to
extract text content from a PDF, whether the text is extracted directly from a
PDF file, when feasible, or from a PS print file, when extraction from the PDF
is not feasible.
Finally, readers who wish to extract text from a PDF should check to see if the
text can be copied and pasted directly from the PDF viewer or if the viewer has
a built-in means to export the text content.
Best regards,
Bill Segraves
-------------- Original message from Leonard Rosenthol <[EMAIL PROTECTED]>:
--------------
Since text extraction from PDF (and PostScript) REQUIRES that you parse the
file format in question in order to gather up font encoding information (since
in most cases the text won't appear as readable text anyway), that also means
that you have to decode (and/or decrypt) any data streams - whether for
printing or simple extraction.
Application of regex on PDF (or PostScript) content is not only complex in and
of itself, given the font encoding situations BUT (as you know) also made
complex by the "splitting up" of content across multiple operations and
potentially not in logical order.
Leonard
On Oct 7, 2007, at 9:04 PM, William A. Segraves wrote:
Good question, Leonard.
If someone were to attempt to use the approach I recommended to extract text
from a locked and encrypted PDF, rather than from the PS print file, he/she
would be doomed to failure, as the regex engine that would be used in the
match/substitution would be unable to find the target text fragments.
The context of the OP's question suggested he was trying to read text content
from a PDF with a program. Naturally, the procedure I recommended requires the
user to be able to open the PDF with Reader and print it to a PS file.
I don't understand your question about "read it to just read it" in the context
of the OP's question. I think we're talking about text extraction. Please
rephrase your question.
BTW, this issue was thoroughly discussed on comp.text.pdf about 4-5 years ago.
Cheers,
Bill Segraves
----- Original Message -----
From: Leonard Rosenthol
To: Post all your questions about iText here
Sent: Sunday, October 07, 2007 6:57 PM
Subject: Re: [iText-questions] u guys konw how to read the data from pdfusing
java itext ?
Why would using this approach fail on a PDF that is locked with a
password? In order to print the PDF, you have to have the ability to
READ it ;). And if you can read it in order to print it, you can
read it to just read it...Right?
Leonard
On Oct 7, 2007, at 11:05 AM, [EMAIL PROTECTED] wrote:
> If the PDF is locked with a password, but still printable, the
> approach offered by this author is one that would work, while
> attempting to use this approach on the original PDF would fail.
> This author was simply trying to help the poster with an approach
> that would avoid the frustration that would ensue if he tried to
> work with an original locked PDF.
>
>
> Of course, the approach espoused by the esteemed sage would be
> easier, for both unlocked and unlocked PDFs. OTOH, this author
> doesn't count easier to fail as an acceptable approach.
>
>
> Cheers,
>
> Bill Segraves
>
> -------------- Original message from Leonard Rosenthol
> <[EMAIL PROTECTED]>: --------------
>
>
> > Why would working through the PostScript be easier than doing
> this on
> > the original PDF?
> >
> > You can get to all the PDF operators just fine.
> > Font & text information is more easily referenceable from the PDF
> > PostScript also has "XObjects", Patterns, etc. that may contain
> text.
> > etc.
> >
> > Not understanding the logic :(.
> >
> > Leonard
> >
> >
> > On Oct 6, 2007, at 4:53 PM, [EMAIL PROTECTED] wrote:
> >
> > > Yes; but it is not practicable with iText. You could, however, as
> > > long as the PDF is printable, use the following procedure:
> > >
> > > 1. Print to a PS file.
> > >
> > > 2. Scan the PS file from step1 above, droppin g all lines that
> > > do not end with Tj or TJ.
> > >
> > > 3. Use a regular expression (together with Substitution or
> > > Match) to extract the instances of "text fragment" from within
> > > multiple instances of "(text fragment)Tj", printing the resulting
> > > text fragments to STDOUT.
> > >
> > > Bruno has given an excellent example of why you should not expect
> > > the resulting output to make sense, i.e., the text fragments may
> > > not appear in the order in which you'd like for them to appear.
> > >
> > > Cheers,
> > >
> > > Bill Segraves
> > >
> > > -------------- Original message from krammark
> > > : --------------
> > >
> > >
> > > >
> > > > so , how we read the data from pdf ?
> > > > i mean , can we read them line by line from the specific pages ?
> > & gt; &g t;
> > > > thanks buddy.
> > > >
> > > >
> > > > Bruno Lowagie (iText) wrote:
> > > > >
> > > > > krammark wrote:
> > > > >> hey gusy,
> > > > >> do u guys have a idea how to read the data from pdf pages
> > > using itext ?
> > > > >> basically, i want to read the data from table and write them
> > > into excel
> > > > >> files.
> > > > >> is that possible ?
> > > > >
> > > > > There is no such thing as 'a table' in plain PDF.
> > > > > It's just lines and words painted on a canvas,
> > > > > possible in an arbitrary order.
> > > > >
> > > > > Unless your tables cells are form fields, or your
> > > > ; > PDF contains specific table structures (Tagged PDF),
> > > > > iText probably won't help you.
> & gt; > > >
> > > > > br,
> > > > > Bruno
> > > > >
> > > > >
> > >
> ----------------------------------------------------------------------
> > > ---
> > > > > This SF.net email is sponsored by: Splunk Inc.
> > > > > Still grepping through log files to find problems? Stop.
> > > > > Now Search log events and configuration files using AJAX and a
> > > browser.
> > > > > Download your FREE copy of Splunk now >> http://
> get.splunk.com/
> > > > > _______________________________________________
> > > > > iText-questions mailing list
> > > > > iText-questions@lists.sourceforge.net
> > > > > https://lists.sourceforge.net/lists/listinfo/itext-questions
> > > > > Buy the iText book: http://itext.ugent.be/itext-in-action/
> > > > >
> > > > >
> > > >
> > > > --
> > > > View this message in context:
> > > > http://www.nabble.com/u-guys-konw -how-t o-read-the-data-from-
> pdf-
> > > using-java-itext
> > > > ---tf4572506.html#a13067937
> > > > Sent from the iText - General mailing list archive at
> Nabble.com.
> > > >
> > > >
> > > >
> > >
> ----------------------------------------------------------------------
> > > ---
> > > > This SF.net email is sponsored by: Splunk Inc.
> > > > Still grepping through log files to find problems? Stop.
> > > > Now Search log events and configuration files using AJAX and a
> > > browser.
> > > > Download your FREE copy of Splunk now >> http://get.splunk.com/
> > > > _______________________________________________
> > > > iText-questions mailing list
> > > > [EMAIL PROTECTED] rge.ne t
> > > > https://lists.sourceforge.net/lists/listinfo/itext-questions
> > > > Buy the iText book: http://itext.ugent.be/itext-in-action/
> > >
> ----------------------------------------------------------------------
> > > ---
> > > This SF.net email is sponsored by: Splunk Inc.
> > > Still grepping through log files to find problems? Stop.
> > > Now Search log events and configuration files using AJAX and a
> > > browser.
> > > Download your FREE copy of Splunk now >> http://get.splunk.com/
> > > _______________________________________________
> > > iText-questions mailing list
> > > iText-questions@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/itext-questions
> > > Buy the iText book: http://itext.ugent.be/itext-in-action/
> >
> >
> >
> ----------------------------------------------------------------------
> ---
> & gt; Th is SF.net email is sponsored by: Splunk Inc.
> > Still grepping through log files to find problems? Stop.
> > Now Search log events and configuration files using AJAX and a
> browser.
> > Download your FREE copy of Splunk now >> http://get.splunk.com/
> > _______________________________________________
> > iText-questions mailing list
> > iText-questions@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/itext-questions
> > Buy the iText book: http://itext.ugent.be/itext-in-action/
> ----------------------------------------------------------------------
> ---
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems? Stop.
> Now Search log events and configuration files using AJAX and a
> browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> _______________________________________________
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
> Buy the iText book: http://itext.ugent.be/itext-in-action/
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>
http://get.splunk.com/_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/