Re: [iText-questions] Problem with pdf regeneration

Mark Storer Fri, 18 Aug 2006 11:08:34 -0700

There's quite a bit of variance possible within a single visual appearance.

For example, fonts can be embedded, subsetted, non-embedded, or replaced by paths. Image DPIs can change.

But in your case, even with exactly the same objects representing the same information, the following two objects are "identical":

-------

/Type /XObject

/Subtype /Form

/FormType 1

/BBox [ 0.00 0.00 50.00 100.00 ]

/Resources << /ProcSet [ /PDF ] >>

/Length 1000

/Matrix [ 1.00 0.00 0.00 1.00 0.00 0.00 ]

/Filter /FlateDecode

stream

/* 1000 bytes of data */

endstream

-------

<</Type/XObject/Subtype/Form/FormType 1/BBox[0 0 50 100]/Resources<</ProcSet[/PDF]>>/Length 1000/Filter/FlateDecode/Matrix[1 0 0 1 0 0]>>

stream

/* 1000 bytes of data */

endstream

-------

The dictionary portion of the first object is 182 bytes long (+10 or +20 depending on how you write out your newlines). The second is 137 (+1 or +2). Identical data, with a difference of as much as 61 bytes. The /Length of a stream is sometimes written out as a seperate object. Each object requires a minimum of 20 (for the object byte offset, but this can require an additional 5 or 6 bytes in some cases) + 16 (each object has to be wrapped with specific tags and information, and this number can go up based on the 'object number' of objects) + the size of the object itself. So moving the /Length from a seperate object into the above dictionaries would shrink the file by 36 bytes (the object's data length doesn't change, "1000" is 4 bytes long no matter how you store it.

In the above example, I've used two decimal places where they were unnecessary. 1.00 -> 1. I've seen far more trailing zeros. iText strips out all trailing zeroes. Hurray iText!

Furthermore, both streams are compressed using "FlateDecode" (aka ZIP) compression. If the first object was uncompressed text, the size of the streams might be radically different, while the data those streams contained was identical. Font and Image data are generally compressed to begin with, but things like content streams and scripts are less likely to be compressed initially (for easier debugging).

If a PDF has been modified multiple times and saved (File->Save in Acrobat, NOT SaveAs), each set of changes is APPENDED to the original file. If you take a large PDF, rip out all but one page, and hit ctrl+s, the resulting file WILL be larger. A file with a trivial amount of information in it that has been through a large number of changes can be disproportionately large (particularly if embedded fonts have been changed... all embedded fonts will still be present, and I don't know that Acrobat is smart enough to find an old copy of a font if you change back to the original). Running a file through iText will result in a "SaveAs" unless you specifically instruct it not to. If you digitially sign a PDF, iText has to force the 'append' behavior... it needs something stable to sign.

iText does not "Linearize" PDFs. If the initial PDF was "web ready" (set up for incremental download, which gets the first page ready for display as fast as possible by pushing all the objects it needs to the front and adding some info), this additional data will be stripped out.

This point won't apply to your particular PDF but is worth mentioning anyway: In 6.0 (PDF 1.5), Adobe introduced a feature in PDF that allows cross reference tables (that +20 bytes or more I mentioned earlier) and the objects themselves to be placed inside compressed streams. If you're willing to abandon Acrobat/Reader 5.0, you can use this feature to shrink your PDFs by still more (though not quite as much as you might thing... quite a bit of most PDF's size is tied up in data that is already compressed... images and fonts). Object data is particularly compressible. Lots of '/', '<<' and so forth. Every object will have a "/Type", many have "/BBox"... /Parent and /Kids is also common, particularly in AcroForm PDFs.

Hey guys! I just called a PDF form an AcroForm without thinking about it for the first time. It feels kinda like when I moved to Texas and used "ya'll" for the first time... only without the following self-flagellation. ;)

--Mark Storer
Senior Software Engineer
Cardiff Software

#include <disclaimer>
typedef std::Disclaimer<Cardiff> DisCard;

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]On Behalf Of [EMAIL PROTECTED]
Sent: Friday, August 18, 2006 8:40 AM
To: Post all your questions about iText here
Subject: [iText-questions] Problem with pdf regeneration

Hi,

This is in continuation with my previous query.
The PDF version being used is 1.4 that is the version of pdf in one.pdf is 1.4 (Acrobat 5.x).
Also another point to be note is the size of one.pdf is 72kB while the size of the regenerated pdf from iText that two.pdf is 71kB.
Why iText is varying the size of the file when it is just regenerating ?

Regards,
Triloke

Triloke RAJBHANDARY/ITD GLT/HSDI/[EMAIL PROTECTED]
ITD GLT/HSDI
Sent by: [EMAIL PROTECTED]
08/18/2006 08:23 PM
Mail Size: 19529

Please respond to
Post all your questions about iText here <[email protected]>

To
[email protected]

cc

Subject
[iText-questions] Problem with pdf regeneration

Our Ref

Your Ref

Hi,
I am using following piece of code to regenerate the pdf.
one.pdf is the pdf that is read.
two.pdf is the pdf that is regenerated using iText.
one.pdf when viewed in Acrobat Reader ver.7.0 is a landscape size of 11.69*8.26
two.pdf when viewed in same Acrobat Reader ver.7.0 is a portrait size of 8.26*11.69
These sizes are seen from the document properties of the Acrobat Reader 7.0
The iText version that I am using is iText 1.4.3.
Why is this happening? As I am just regenerating the pdf should not it generate the replica of it ? I haven't used any rotate in the code. The portion of the code is attached below:

try {
PdfReader reader = new PdfReader("C:\\test\\one.pdf");
int n = reader.getNumberOfPages();
Rectangle psize = reader.getPageSize(1);
float width = psize.width();
System.out.println("one width >>"+width);
float height = psize.height();
System.out.println(" one height >>"+height);
Document document = new Document(psize,0,0,0,0);
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream("c:\\test\\two.pdf"));
document.open();
PdfContentByte cb = writer.getDirectContent();
int i = 0;
int p = 0;
System.out.println("There are " + n + " pages in the document.");
while (i < n) {
document.newPage();
p++;
i++;
PdfImportedPage page1 = writer.getImportedPage(reader, i);
cb.addTemplate(page1, 1.0f, 0, 0, 1.0f, 0, 0);
}
document.close();
PdfReader reader2 = new PdfReader("c:\\test\\two.pdf");
Rectangle p2size = reader2.getPageSize(1);
System.out.println("two width >>"+p2size.width());
System.out.println("two height >>"+p2size.height());

Regards,
Triloke

************************************************************
HSBC Software Development (India) Pvt Ltd
HSBC Center Riverside,West Avenue ,
25 B Kalyani Nagar Pune 411 006 INDIA

Telephone: +91 20 26683000
Fax: +91 20 26681030
************************************************************

*******************************************************************
This e-mail is confidential. It may also be legally privileged.
If you are not the addressee you may not copy, forward, disclose
or use any part of it. If you have received this message in error,
please delete it and all copies from your system and notify the
sender immediately by return e-mail.

Internet communications cannot be guaranteed to be timely,
secure, error or virus-free. The sender does not accept liability
for any errors or omissions.
*******************************************************************
"SAVE PAPER - THINK BEFORE YOU PRINT!"

******************************************************************
This message originated from the Internet. Its originator may or
may not be who they claim to be and the information contained in
the message and any attachments may or may not be accurate.
******************************************************************------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 ----------------------------------------- ****************************************************************** This message originated from the Internet. Its originator may or may not be who they claim to be and the information contained in the message and any attachments may or may not be accurate. ****************************************************************** _______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions ----------------------------------------- ****************************************************************** This message originated from the Internet. Its originator may or may not be who they claim to be and the information contained in the message and any attachments may or may not be accurate. ******************************************************************

************************************************************
HSBC Software Development (India) Pvt Ltd
HSBC Center Riverside,West Avenue ,
25 B Kalyani Nagar Pune 411 006 INDIA

Telephone: +91 20 26683000
Fax: +91 20 26681030
************************************************************

*******************************************************************
This e-mail is confidential. It may also be legally privileged.
If you are not the addressee you may not copy, forward, disclose
or use any part of it. If you have received this message in error,
please delete it and all copies from your system and notify the
sender immediately by return e-mail.

Internet communications cannot be guaranteed to be timely,
secure, error or virus-free. The sender does not accept liability
for any errors or omissions.
*******************************************************************
"SAVE PAPER - THINK BEFORE YOU PRINT!"

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Re: [iText-questions] Problem with pdf regeneration

Reply via email to