I think that this will ultimately come down to using PdfEncodings to process 
the ToUnicode value associated with the content stream/page.  Once that's done 
and tested, we'll have usable text strings.  Then it's a matter of correctly 
handling each of the text operations, the current coordinate system, etc... to 
determine which strings actually belong next to which other strings.  Like 
Paulo said in another email, getting the text out of the content stream 
requires a bit of rocket science - there are many pitfalls, and pretty much any 
algorithm you come up with will be guaranteed to *not* work in some situations 
(even Acrobat's internal algorithms that it uses for search and text selection 
break down with some PDF files).

I'm actually working on this right now.  With all of the tools avialable in 
iText, the only remaining part (I'm pretty sure) is the spacial analysis.

If I get some code together I'll post it for comment.

- K


----------------------- Original Message -----------------------
 
From: Vinoo <[EMAIL PROTECTED]>
To: itext-questions@lists.sourceforge.net
Cc: 
Date: Wed, 29 Oct 2008 11:21:06 -0700 (PDT)
Subject: Re: [iText-questions] Help with parsing the PDF generated by Crystal 
reports-V9
 

Thanks Kevin for the reply and providing some insight into this.

Can we usePdfEncoding to convert the format and use them for the extraction.

Thanks,
Uma


Vinoo wrote:
> 
> Hi,
> I am trying to parse the contents of the PDF with iTextSharp using :
> PdfReader reader = new PdfReader("Test.pdf");
> reader.GetPageContent(pageNumber);
> byte[] pageContentByteArray;
> I am using this byte array to search for a partcular text based on a
> Delimiter pattern by converting this to string by using -
> string test = Encoding.ASCII.GetString(pageContentByteArray);
> I am able to match the required text pattern inside the string generated
> using the above statement. The above logic works absolutely fine if we use
> a normal PDF input file.
> My requirement is to read a PDF file which is created by CRYSTAL REPORTS
> (Version-9).
> I have a byte array of the page with me. But I tried to convert to string
> using ASCII, UNICODE , UTF8, UnicodeBig..
>             string test =
> Encoding.ASCII.GetString(invoicePageContentByteArray);
>             string test =
> Encoding.Unicode.GetString(invoicePageContentByteArray);
>             string test =
> Encoding.UTF8.GetString(invoicePageContentByteArray);
>                         ..... also using UnicodeBig
>  
> The output is not in the readable format. I could not find any text in the
> page appearing in the output string. I guess the PDF generated out of
> crystal reports is using some other encoding format.  
> (Note : We verified the template used by crystal reports to generate the
> PDF. The search delimiter pattern is defined as the Text object)
> There should be some way of doing the above. Not sure what is that I am
> missing here. Can anyone please suggest ideas to resolve the above
> problem.
> 
> -- 
> Regards,
> Uma
> 

-- 
View this message in context: 
http://www.nabble.com/Help-with-parsing-the-PDF-generated-by-Crystal-reports-V9-tp20229737p20232896.html
Sent from the iText - General mailing list archive at Nabble.com.


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

Reply via email to