[iText-questions] Fwd: text-as-string extraction from pdf to applet

Paolo Russian Tue, 06 Jul 2010 02:29:26 -0700

Hello, I have a closed source java applet to digitally sign files with

usb keys /cards etc, and I'm adding this to our itext-generated pdfs
section of our jboss seam application.
This applet works by feeding it by a real url of a real pdf file or by
providing a string-representation of the pdf file. No further details on
this on the documentation. Since the applet is out of conversation and
out of session scopes, and pages.xml is configured to require login on
every page, passing to the applet a server url will cause a correctly
signed "login page", not exactly what I need.
So I'm trying to extract this text out from the pdf and building a
string, but got problems. I already extract data out of the original
itext xhtml file and save it to server filesystem successfully (so that
it becames a real pdf file and not a let's say..."virtual" pdf), then
I'm trying to read again bytes out of this real pdf and building a
string but I always miss some extended characters, probably due to a
wrong encoding.
I tried almost every possible java encoding but the best result I get,
comparing original pdf with signed.p7m-extracted-resaved pdf with
BeyondCompare in binary mode is a file equal to the original at 95%, in
utf-8, but the page displays all white.


To make it short, extracting bytes from the pdf I can't get all of the
characters and get a lot of ?????? instead of the correct original ones.
The error is not on the saving part of the code since when I pass a real
url (like c:\whatever.pdf) it works like charm. So it must be something
very newbish and wrong in the way I access the pdf data. We run seam
2.0.1 so no integrated itext pdfReader(..) stuff I've been reading post
about in this day, and no chance we can update.



the bunch of code that does the text ectraction from pdf file is
(try/catches omitted ecc, never got exceptions btw):

public String getStingOutOfPdf(){

Renderer render = Renderer.instance();
render.render(path);
DocumentStore doc = DocumentStore.instance();

if (doc != null) {
DocumentData data = doc.getDocumentData("1");
DocumentData byteData = null;
byteData = (DocumentData) data;
bytes = byteData.getData();
return getText(bytes);
}


public String getText (byte[] arr)
{
String s = null;

try {
s = new String( arr, "UTF8"); // tried everything out of there, utf8 and
ISO8859_1 get extremely close to the original but not enough)
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}





the original looks like:
%PDF-1.4
%âãÏÓ
4 0 obj<</Length 126/Filter/FlateDecode>>stream
xœM±
[...omissis....]


and the final pdf, obtained by passing this string to applet, signed,
extracted from signed back to pdf, is:
%PDF-1.4
%????
4 0 obj<</Length 127/Filter/FlateDecode>>stream
x?+?r
[...omissis....]

as you can see I've got some missing chars, and this messes up the pdf
obviously.
Where I'm wrong?

Thanks


Paolo


------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

[iText-questions] Fwd: text-as-string extraction from pdf to applet

Reply via email to