[iText-questions] text-as-string extraction from pdf to applet

Paolo Russian Tue, 06 Jul 2010 01:50:56 -0700

Hello, I have a closed source java applet to digitally sign files with 
usb keys /cards etc, and I'm adding this to our itext-generated pdfs 
section of our jboss seam application.
This applet works by feeding it by a real url of a real pdf file or by 
providing a string-representation of the pdf file. No further details on 
this on the documentation. Since the applet is out of conversation and 
out of session scopes, and pages.xml is configured to require login on 
every page, passing to the applet a server url will cause a correctly 
signed "login page", not exactly what I need.
So I'm trying to extract this text out from the pdf and building a 
string, but got problems. I already extract data out of the original 
itext xhtml file and save it to server filesystem successfully (so that 
it becames a real pdf file and not a let's say..."virtual" pdf), then 
I'm trying to read again bytes out of this real pdf and building a 
string but I always miss some extended characters, probably due to a 
wrong encoding.
I tried almost every possible java encoding but the best result I get, 
comparing original pdf with signed.p7m-extracted-resaved pdf with 
BeyondCompare in binary mode is a file equal to the original at 95%, in 
utf-8, but the page displays all white.


To make it short, extracting bytes from the pdf I can't get all of the 
characters and get a lot of ?????? instead of the correct original ones. 
The error is not on the saving part of the code since when I pass a real 
url (like c:\whatever.pdf) it works like charm. So it must be something 
very newbish and wrong in the way I access the pdf data. We run seam 
2.0.1 so no integrated itext pdfReader(..) stuff I've been reading post 
about in this day, and no chance we can update.



the bunch of code that does the text ectraction from pdf file is 
(try/catches omitted ecc, never got exceptions btw):

public String getStingOutOfPdf(){

Renderer render = Renderer.instance();
render.render(path);
DocumentStore doc = DocumentStore.instance();

if (doc != null) {
DocumentData data = doc.getDocumentData("1");
DocumentData byteData = null;
byteData = (DocumentData) data;
bytes = byteData.getData();
return getText(bytes);
}


public String getText (byte[] arr)
{
String s = null;

try {
s = new String( arr, "UTF8"); // tried everything out of there, utf8 and 
ISO8859_1 get extremely close to the original but not enough)
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}





the original looks like:
%PDF-1.4
%âãÏÓ
4 0 obj <</Length 126/Filter/FlateDecode>>stream
xœM±
[...omissis....]


and the final pdf, obtained by passing this string to applet, signed, 
extracted from signed back to pdf, is:
%PDF-1.4
%????
4 0 obj <</Length 127/Filter/FlateDecode>>stream
x?+?r
[...omissis....]

as you can see I've got some missing chars, and this messes up the pdf 
obviously.
Where I'm wrong?

Thanks


Paolo

------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

[iText-questions] text-as-string extraction from pdf to applet

Reply via email to