Re: [iText-questions] u guys konw how to read the data frompdfusing java itext ?

Paulo Soares Mon, 08 Oct 2007 09:57:31 -0700

Try this:

PdfReader r = new PdfReader("c:\\out-1.pdf");
for (int j = 1; j <= r.getNumberOfPages(); ++j) {
    System.out.println("Page " + j);
    byte[] b = r.getPageContent(j);
    PdfContentParser cp = new PdfContentParser(new PRTokeniser(b));
    ByteBuffer bb = new ByteBuffer();
    ArrayList a = new ArrayList();
    while (!cp.parse(a).isEmpty()) {
        String cmd = a.get(a.size() - 1).toString();
        if (cmd.equals("Tj") || cmd.equals("TJ") || cmd.equals("'") ||
cmd.equals("\""))
            continue;
        for (int k = 0; k < a.size(); ++k) {
            PdfObject obj = (PdfObject)a.get(k);
            obj.toPdf(null, bb);
            bb.append(' ');
        }
        bb.append('\n');
    }
    r.setPageContent(j, bb.toByteArray());
}
PdfStamper s = new PdfStamper(r, new
FileOutputStream("c:\\no_text.pdf"));
s.close();


I attach the resulting PDF.

Paulo 

> -----Original Message-----
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On 
> Behalf Of Sarath Dorbala
> Sent: Monday, October 08, 2007 5:48 PM
> To: Post all your questions about iText here
> Subject: Re: [iText-questions] u guys konw how to read the 
> data frompdfusing java itext ?
> 
> My Bad..  here are the attachments.
> original PDF (out-1.pdf) and converted PS (file.ps) 
> 
>  
> On 10/8/07, Sarath Dorbala <[EMAIL PROTECTED]> wrote: 
> 
>       Hello,
>       I am sorry if I am digressing. I have a bunch of PDF 
> files in which I just want to blank out the text. As I read 
> the threads of this particular post, I got an idea of 
> replacing any textby doing this
>        
>       (SOME TEXT) Tj to (         ) Tj
>        
>       But when I converted my pdf file to ps using pdf2ps, I 
> did not see any of such structures. Still the ps file could 
> be viewed using ghostscript viewer. I dont know if my PDF is 
> encoded in some sense. Is there any other way text is rendered in ps? 
>        
>       I have attached my original PDF and converted PS file 
> for your reference. 
>        
>       Sorry if it sounds very naive. I am pretty new to this 
> whole thing.
>        
>       Thank you,
>       
>       Sarath
>       
>        
>       On 10/8/07, [EMAIL PROTECTED] 
> <mailto:[EMAIL PROTECTED]>  <[EMAIL PROTECTED] > wrote: 
> 
>               
> 
>               Oops! I forgot to respond to the second 
> paragraph in your response.
> 
>                
> 
>               In the second paragraph, it seems you are 
> arguing too strongly for your own limitations.
> 
>                
> 
>               The procedure I outlined involves only 
> rudimentary parsing, matching, and substitution, and printing 
> of text fragments to STDOUT, in the order in which they are 
> found. What could be simpler than parsing lines for instances of 
> 
>                
> 
>                    (some text here)Tj
> 
>                
> 
>               or
> 
>                
> 
>                    (some more text here)TJ
> 
>                
> 
>               and on finding them, extracting the embedded 
> text fragments, and printing them to STDOUT? My Perl script 
> was only a few lines of code. I expect I could easily rewrite 
> it in Java for incorporation in an iText tool. I know I could 
> write the script as a shell script. In fact, I think it could 
> be written as a one-liner. In summary, the procedure I 
> outlined could be coded very simply in most any programmers 
> favorite scripting language, provided the language has a 
> built-in or otherwise available regex engine.  
> 
>                
> 
>               The "pattern" can be as robust as you care to 
> make it, i.e., characters only, characters + punctuation marks, etc.
> 
>                
> 
>               Bruno and I both cautioned readers carefully 
> about the possibility of text fragments not appearing in the 
> correct order, i.e., the order in which they can be seen in 
> the displayed PDF, together with reasons why this is so. The 
> reasons are well known, so no further discussion on "order" 
> is necessary or desired. 
> 
>                
> 
>               The procedure I outlined was developed and 
> tested about 5 years ago. The results from extracting the 
> text from a part of the Ghostscript User Manual for which the 
> PS (.prn, in my case) file that was 353KB (7268L, 361091C) 
> resulting in a plain text file that was 1.75KB (30L, 1799C). 
> Clearly, the procedure I outlined has potential for saving a 
> lot of time for someone who needs to extract text content 
> from a PDF, whether the text is extracted directly from a PDF 
> file, when feasible, or from a PS print file, when extraction 
> from the PDF is not feasible. 
> 
>                
> 
>               Finally, readers who wish to extract text from 
> a PDF should check to see if the text can be copied and 
> pasted directly from the PDF viewer or if the viewer has a 
> built-in means to export the text content.
> 
>                
> 
>               Best regards,
> 
>               Bill Segraves
> 
>                       -------------- Original message from 
> Leonard Rosenthol < [EMAIL PROTECTED] 
> <mailto:[EMAIL PROTECTED]> >: -------------- 
>                       
>                       
>                       Since text extraction from PDF (and 
> PostScript) REQUIRES that you parse the file format in 
> question in order to gather up font encoding information 
> (since in most cases the text won't appear as readable text 
> anyway), that also means that you have to decode (and/or 
> decrypt) any data streams - whether for printing or simple 
> extraction. 
> 
>                        
>                       Application of regex on PDF (or 
> PostScript) content is not only complex in and of itself, 
> given the font encoding situations BUT (as you know) also 
> made complex by the "splitting up" of content across multiple 
> operations and potentially not in logical order. 
> 
>                        
>                       Leonard
> 
>                        
> 
>                       On Oct 7, 2007, at 9:04 PM, William A. 
> Segraves wrote:
> 
> 
>                               Good question, Leonard.
>                               
>                               If someone were to attempt to 
> use the approach I recommended to extract text from a locked 
> and encrypted PDF, rather than from the PS print file, he/she 
> would be doomed to failure, as the regex engine that would be 
> used in the match/substitution would be unable to find the 
> target text fragments. 
>                               
>                               The context of the OP's 
> question suggested he was trying to read text content from a 
> PDF with a program. Naturally, the procedure I recommended 
> requires the user to be able to open the PDF with Reader and 
> print it to a PS file. 
>                               
>                               I don't understand your 
> question about "read it to just read it" in the context of 
> the OP's question. I think we're talking about text 
> extraction. Please rephrase your question. 
>                               
>                               BTW, this issue was thoroughly 
> discussed on comp.text.pdf about 4-5 years ago.
>                               
>                               Cheers,
>                               Bill Segraves
> 
>                                       ----- Original Message ----- 
>                                       From: Leonard Rosenthol 
> <mailto:[EMAIL PROTECTED]> 
>                                       To: Post all your 
> questions about iText here 
> <mailto:itext-questions@lists.sourceforge.net> 
>                                       Sent: Sunday, October 
> 07, 2007 6:57 PM
>                                       Subject: Re: 
> [iText-questions] u guys konw how to read the data from 
> pdfusing java itext ?
> 
>                                        
>                                       Why would using this 
> approach fail on a PDF that is locked with a 
>                                       password? In order to 
> print the PDF, you have to have the ability to 
>                                       READ it ;). And if you 
> can read it in order to print it, you can 
>                                       read it to just read it...Right?
>                                       
>                                       Leonard
>                                       
>                                       
>                                       On Oct 7, 2007, at 
> 11:05 AM, [EMAIL PROTECTED] wrote:
>                                       
>                                       > If the PDF is locked 
> with a password, but still printable, the 
>                                       > approach offered by 
> this author is one that would work, while 
>                                       > attempting to use 
> this approach on the original PDF would fail. 
>                                       > This author was 
> simply trying to help the poster with an approach 
>                                       > that would avoid the 
> frustration that would ensue if he tried to 
>                                       > work with an original 
> locked PDF.
>                                       >
>                                       >
>                                       > Of course, the 
> approach espoused by the esteemed sage would be 
>                                       > easier, for both 
> unlocked and unlocked PDFs. OTOH, this author 
>                                       > doesn't count easi er 
> to fail as an acceptable approach.
>                                       >
>                                       >
>                                       > Cheers,
>                                       >
>                                       > Bill Segraves
>                                       >
>                                       > -------------- 
> Original message from Leonard Rosenthol 
>                                       > 
> <[EMAIL PROTECTED]>: --------------
>                                       >
>                                       >
>                                       > > Why would working 
> through the PostScript be easier than doing 
>                                       > this on
>                                       > > the original PDF?
>                                       > >
>                                       > > You can get to all 
> the PDF operators just fine.
>                                       > > Font & text 
> information is more easily referenceable from the PDF
>                                       > > PostScript also has 
> "XObjects", Patterns, etc. that may contain 
>                                       > text.
>                                       > > etc.
>                                       > >
>                                       > > Not understanding 
> the logic :(.
>                                       > >
>                                       > > Leonard
>                                       > >
>                                       > >
>                                       > > On Oct 6, 2007, at 
> 4:53 PM, [EMAIL PROTECTED] 
> <mailto:[EMAIL PROTECTED]>  wrote:
>                                       > >
>                                       > > > Yes; but it is 
> not practicable with iText. You could, howev er, as
>                                       > > > long as the PDF 
> is printable, use the following procedure:
>                                       > > > 
>                                       > > > 1. Print to a PS file.
>                                       > > >
>                                       > > > 2. Scan the PS 
> file from step1 above, droppin g all lines that
>                                       > > > do not end with Tj or TJ.
>                                       > > >
>                                       > > > 3. Use a regular 
> expression (together with Substitution or 
>                                       > > > Match) to extract 
> the instances of "text fragment" from within
>                                       > > > multiple 
> instances of "(text fragment)Tj", printing the resulting
>                                       > > > text fragments to STDOUT. 
>                                       > > >
>                                       > > > Bruno has given 
> an excellent example of why you should not expect
>                                       > > > the resulting 
> output to make sense, i.e., the text fragments may
>                                       > > > not appear in the 
> order in which you'd like for them to appear. 
>                                       > > >
>                                       > > > Cheers,
>                                       > > >
>                                       > > > Bill Segraves
>                                       > > >
>                                       > > > --- ------ ----- 
> Original message from krammark
>                                       > > > : --------------
>                                       > > > 
>                                       > > >
>                                       > > > >
>                                       > > > > so , how we 
> read the data from pdf ?
>                                       > > > > i mean , can we 
> read them line by line from the specific pages ?
>                                       > > & gt; &g t; 
>                                       > > > > thanks buddy.
>                                       > > > >
>                                       > > > >
>                                       > > > > Bruno Lowagie 
> (iText) wrote:
>                                       > > > > >
>                                       > > > > > krammark wrote:
>                                       > > > > >> hey gusy, 
>                                       > > > > >> do u guys 
> have a idea how to read the data from pdf pages
>                                       > > > using itext ?
>                                       > > > > >> basically, i 
> want to read the data from table and write them
>                                       > > > into excel 
>                                       > > > > >> files.
>                                       > > > > >> is that possible ?
>                                       > > > > >
>                                       > > > > > There is no 
> such thing as 'a table' in plain PDF.
>                                       > > > > > It's just 
> lines and words painted on a canvas, 
>                                       > > > > > possible in 
> an arbitrary order.
>                                       > > > > >
>                                       > > > > > Unless your 
> tables cells are form fields, or your
>                                       > > > > ; > PDF 
> contains specific table structures (Tagged PDF), 
>                                       > > > > > iText 
> probably won't help you.
>                                       > & gt; > > >
>                                       > > > > > br,
>                                       > > > > > Bruno


Aviso Legal:
Esta mensagem é destinada exclusivamente ao destinatário. Pode conter 
informação confidencial ou legalmente protegida. A incorrecta transmissão desta 
mensagem não significa a perca de confidencialidade. Se esta mensagem for 
recebida por engano, por favor envie-a de volta para o remetente e apague-a do 
seu sistema de imediato. É proibido a qualquer pessoa que não o destinatário de 
usar, revelar ou distribuir qualquer parte desta mensagem. 

Disclaimer:
This message is destined exclusively to the intended receiver. It may contain 
confidential or legally protected information. The incorrect transmission of 
this message does not mean the loss of its confidentiality. If this message is 
received by mistake, please send it back to the sender and delete it from your 
system immediately. It is forbidden to any person who is not the intended 
receiver to use, distribute or copy any part of this message.

no_text.pdf
Description: no_text.pdf

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/

_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/

Re: [iText-questions] u guys konw how to read the data frompdfusing java itext ?

Reply via email to