Hello,
I am sorry if I am digressing. I have a bunch of PDF files in which I just
want to blank out the text. As I read the threads of this particular post, I
got an idea of replacing any textby doing this
(SOME TEXT) Tj to ( ) Tj
But when I converted my pdf file to ps using pdf2ps, I did not see any of
such structures. Still the ps file could be viewed using ghostscript viewer.
I dont know if my PDF is encoded in some sense. Is there any other way text
is rendered in ps?
I have attached my original PDF and converted PS file for your reference.
Sorry if it sounds very naive. I am pretty new to this whole thing.
Thank you,
Sarath
On 10/8/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>
> Oops! I forgot to respond to the second paragraph in your response.
>
>
>
> In the second paragraph, it seems you are arguing too strongly for your
> own limitations.
>
>
>
> The procedure I outlined involves only rudimentary parsing, matching, and
> substitution, and printing of text fragments to STDOUT, in the order in
> which they are found. What could be simpler than parsing lines for instances
> of
>
>
>
> (some text here)Tj
>
>
>
> or
>
>
>
> (some more text here)TJ
>
>
>
> and on finding them, extracting the embedded text fragments, and printing
> them to STDOUT? My Perl script was only a few lines of code. I expect I
> could easily rewrite it in Java for incorporation in an iText tool. I know I
> could write the script as a shell script. In fact, I think it could be
> written as a one-liner. In summary, the procedure I outlined could be coded
> very simply in most any programmers favorite scripting language, provided
> the language has a built-in or otherwise available regex engine.
>
>
>
> The "pattern" can be as robust as you care to make it, i.e., characters
> only, characters + punctuation marks, etc.
>
>
>
> Bruno and I both cautioned readers carefully about the possibility of text
> fragments not appearing in the correct order, i.e., the order in which
> they can be seen in the displayed PDF, together with reasons why this is so.
> The reasons are well known, so no further discussion on "order" is necessary
> or desired.
>
>
>
> The procedure I outlined was developed and tested about 5 years ago. The
> results from extracting the text from a part of the Ghostscript User Manual
> for which the PS (.prn, in my case) file that was 353KB (7268L, 361091C)
> resulting in a plain text file that was 1.75KB (30L, 1799C). Clearly, the
> procedure I outlined has potential for saving a lot of time for someone who
> needs to extract text content from a PDF, whether the text is extracted
> directly from a PDF file, when feasible, or from a PS print file, when
> extraction from the PDF is not feasible.
>
>
>
> Finally, readers who wish to extract text from a PDF should check to see
> if the text can be copied and pasted directly from the PDF viewer or if the
> viewer has a built-in means to export the text content.
>
>
>
> Best regards,
>
> Bill Segraves
>
> -------------- Original message from Leonard Rosenthol <
> [EMAIL PROTECTED]>: --------------
>
> Since text extraction from PDF (and PostScript) REQUIRES that you parse
> the file format in question in order to gather up font encoding information
> (since in most cases the text won't appear as readable text anyway), that
> also means that you have to decode (and/or decrypt) any data streams -
> whether for printing or simple extraction.
>
>
> Application of regex on PDF (or PostScript) content is not only complex in
> and of itself, given the font encoding situations BUT (as you know) also
> made complex by the "splitting up" of content across multiple operations and
> potentially not in logical order.
>
>
> Leonard
>
>
>
> On Oct 7, 2007, at 9:04 PM, William A. Segraves wrote:
>
> Good question, Leonard.
> If someone were to attempt to use the approach I recommended to extract
> text from a locked and encrypted PDF, rather than from the PS print file,
> he/she would be doomed to failure, as the regex engine that would be used in
> the match/substitution would be unable to find the target text fragments.
> The context of the OP's question suggested he was trying to read text
> content from a PDF with a program. Naturally, the procedure I recommended
> requires the user to be able to open the PDF with Reader and print it to a
> PS file.
> I don't understand your question about "read it to just read it" in the
> context of the OP's question. I think we're talking about text extraction.
> Please rephrase your question.
> BTW, this issue was thoroughly discussed on comp.text.pdf about 4-5 years
> ago.
> Cheers,
> Bill Segraves
>
> ----- Original Message -----
> *From:* Leonard Rosenthol <[EMAIL PROTECTED]>
> *To:* Post all your questions about iText
> here<itext-questions@lists.sourceforge.net>
> *Sent:* Sunday, October 07, 2007 6:57 PM
> *Subject:* Re: [iText-questions] u guys konw how to read the data from
> pdfusing java itext ?
>
>
> Why would using this approach fail on a PDF that is locked with a
> password? In order to print the PDF, you have to have the ability to
> READ it ;). And if you can read it in order to print it, you can
> read it to just read it...Right?
>
> Leonard
>
>
> On Oct 7, 2007, at 11:05 AM, [EMAIL PROTECTED] wrote:
>
> > If the PDF is locked with a password, but still printable, the
> > approach offered by this author is one that would work, while
> > attempting to use this approach on the original PDF would fail.
> > This author was simply trying to help the poster with an approach
> > that would avoid the frustration that would ensue if he tried to
> > work with an original locked PDF.
> >
> >
> > Of course, the approach espoused by the esteemed sage would be
> > easier, for both unlocked and unlocked PDFs. OTOH, this author
> > doesn't count easi er to fail as an acceptable approach.
> >
> >
> > Cheers,
> >
> > Bill Segraves
> >
> > -------------- Original message from Leonard Rosenthol
> > <[EMAIL PROTECTED]>: --------------
> >
> >
> > > Why would working through the PostScript be easier than doing
> > this on
> > > the original PDF?
> > >
> > > You can get to all the PDF operators just fine.
> > > Font & text information is more easily referenceable from the PDF
> > > PostScript also has "XObjects", Patterns, etc. that may contain
> > text.
> > > etc.
> > >
> > > Not understanding the logic :(.
> > >
> > > Leonard
> > >
> > >
> > > On Oct 6, 2007, at 4:53 PM, [EMAIL PROTECTED] wrote:
> > >
> > > > Yes; but it is not practicable with iText. You could, howev er, as
> > > > long as the PDF is printable, use the following procedure:
> > > >
> > > > 1. Print to a PS file.
> > > >
> > > > 2. Scan the PS file from step1 above, droppin g all lines that
> > > > do not end with Tj or TJ.
> > > >
> > > > 3. Use a regular expression (together with Substitution or
> > > > Match) to extract the instances of "text fragment" from within
> > > > multiple instances of "(text fragment)Tj", printing the resulting
> > > > text fragments to STDOUT.
> > > >
> > > > Bruno has given an excellent example of why you should not expect
> > > > the resulting output to make sense, i.e., the text fragments may
> > > > not appear in the order in which you'd like for them to appear.
> > > >
> > > > Cheers,
> > > >
> > > > Bill Segraves
> > > >
> > > > --- ------ ----- Original message from krammark
> > > > : --------------
> > > >
> > > >
> > > > >
> > > > > so , how we read the data from pdf ?
> > > > > i mean , can we read them line by line from the specific pages ?
> > > & gt; &g t;
> > > > > thanks buddy.
> > > > >
> > > > >
> > > > > Bruno Lowagie (iText) wrote:
> > > > > >
> > > > > > krammark wrote:
> > > > > >> hey gusy,
> > > > > >> do u guys have a idea how to read the data from pdf pages
> > > > using itext ?
> > > > > >> basically, i want to read the data from table and write them
> > > > into excel
> > > > > >> files.
> > > > > >> is that possible ?
> > > > > >
> > > > > > There is no such thing as 'a table' in plain PDF.
> > > > > > It's just lines and words painted on a canvas,
> > > > > > possible in an arbitrary order.
> > > > > >
> > > > > > Unless your tables cells are form fields, or your
> > > > > ; > PDF contains specific table structures (Tagged PDF),
> > > > > > iText probably won't help you.
> > & gt; > > >
> > > > > > br,
> > > > > > Bruno
> > > > > >
> > > > > >
> > > >
> > ----------------------------------------------------------------------
> > > > ---
> > > > > > This SF.net email is sponsored by: Splunk Inc.
> > > > > > Still grepping through log files to find problems? Stop.
> > > > > > Now Search log events and configuration files using AJAX and a
> > > > browser.
> > > > > > Download your FREE co py of Splunk now >> http://
> > get.splunk.com/
> > > > > > _______________________________________________
> > > > > > iText-questions mailing list
> > > > > > iText-questions@lists.sourceforge.net
> > > > > > https://lists.sourceforge.net/lists/listinfo/itext-questions
> > > > > > Buy the iText book: http://itext.ugent.be/itext-in-action/
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > View this message in context:
> > > > > http://www.nabble.com/u-guys-konw -how-t o-read-the-data-from-
> > pdf-
> > > > using-java-itext
> > > > > ---tf457250 6.html #a13067937
> > > > > Sent from the iText - General mailing list archive at
> > Nabble.com <http://nabble.com/>.
> > > > >
> > > > >
> > > > >
> > > >
> > ----------------------------------------------------------------------
> > > > ---
> > > > > This SF.net email is sponsored by: Splunk Inc.
> > > > > Still grepping through log files to find problems? Stop.
> > > > > Now Search log events and configuration files using AJAX and a
> > > > browser.
> > > > > Download your FREE copy of Splunk now >> http://get.splunk.com/
> > > > > _______________________________________________
> > > > > iText-questions mailing list
> > > > > [EMAIL PROTECTED] rge.ne t
> > > > > https://lists.sourceforge.net/lists/listinfo/itext-questions
> > > > > Buy the iText book: http://itext.ugent.be/itext-in-action/
> > > >
> > ----------------------------------------------------------------------
> > > > ---
> > > > This SF.net email is sponsored by: Splunk Inc.
> > > > Still grepping through log files to find problems? Stop.
> > > > Now Search log events and configuration files using AJAX and a
> > > > browser.
> > > > Download your FREE copy of Splunk now >> http://get.splunk.com/
> > > > _______________________________________________
> > > > iText-questions mailing list
> > > > iText-questions@lists.sourceforge.net
> > > > https://lists.sourceforge.net/lists/listinfo/itext-questions
> > > > Buy the iText book: http://itext.ugent.be/itext-in-action/
> > >
> > >
> > >
> > ----------------------------------------------------------------------
> > ---
> > & gt; Th is SF.net email is sponsored by: Splunk Inc.
> > > Still grepping through log files to find problems? Stop.
> > > Now Search log events and configuration files using AJAX and a
> > browser.
> > > Download your FREE copy of Splunk now >> http://get.splunk.com/
> > > _______________________________________________
> > > iText-questions mailing list
> > > iText-questions@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/itext-questions<https://lists.sourceforge.net/lists/l%0A+istinf%0Ao/itext-questions>
> > > Buy the iText book: http://itext.ugent.be/itext-in-action/
> > ----------------------------------------------------------------------
> > ---
> > This SF.net email is sponsored by: Splunk Inc.
> > Still grepping through log files to find problems? Stop.
> > Now Search log events and configuration files using AJAX and a
> > browser.
> > Download your FREE copy of Splunk now >> http://get.splunk.com/
> > _______________________________________________
> > iText-questions mailing list
> > iText-questions@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/itext-questions
> > Buy the iText book: http://itext.ugent.be/itext-in-action/
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems? Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> _______________________________________________
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
> Buy the iText book: http://itext.ugent.be/itext-in-action/
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems? Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >>
> http://get.splunk.com/_______________________________________________
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
> Buy the iText book: http://itext.ugent.be/itext-in-action/
>
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems? Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> _______________________________________________
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
> Buy the iText book: http://itext.ugent.be/itext-in-action/
>
>
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/