Re: [iText-questions] Parsing PDF retrieving layers and images

Leonard Rosenthol Sun, 07 Aug 2011 09:01:29 -0700

>The stream in this example contains the following attributes:
>{/Intent=/RelativeColorimetric, /Decode=[0.0, 255.0], /Type=/XObject, 
>/Subtype=/Image, /ColorSpace=22 0 R, /Name=/X, /BitsPerComponent=8, 
>/Width=314, >/Metadata=24 0 R, /Length=125, /Height=323, /Filter=/FlateDecode}
>
Grab the metadata and read it!  It should have all sorts of useful stuff in 
there.  (you can read it with Adobe Acrobat by clicking on the image and then 
right-click to Metadata).

Leonard

From: Lukas Johansson <[email protected]<mailto:[email protected]>>
Reply-To: Post here 
<[email protected]<mailto:[email protected]>>
Date: Sun, 7 Aug 2011 05:27:03 -0700
To: Post here 
<[email protected]<mailto:[email protected]>>
Subject: Re: [iText-questions] Parsing PDF retrieving layers and images

Thank you for your well written answer. I've  tried you suggestion but stil 
cannot find the image<->layer information. Se below.

> We'd have to see a PDF to make sure if the XML is still there.
I did attach the PDF I'm working with in the original mail, of course it might 
have been cleaned away by the mailing list. Anyhow here is it: 
http://osram.sajbertown.com/itext-evaluation.pdf

> We'd have to see a PDF to make sure if the XML is still there.
> If it is, you need to get the PRStream of the Image and get the bytes of that 
> stream.
if (stream.get(PdfName.SUBTYPE) != null && 
stream.get(PdfName.SUBTYPE).equals(PdfName.IMAGE)){
    PdfObject oc = stream.get(PdfName.OC);
    //oc == null
    PdfImageObject image = new PdfImageObject(stream);
    oc =  image.get(PdfName.OC);
    //oc == null
}
The stream in this example contains the following attributes:
{/Intent=/RelativeColorimetric, /Decode=[0.0, 255.0], /Type=/XObject, 
/Subtype=/Image, /ColorSpace=22 0 R, /Name=/X, /BitsPerComponent=8, /Width=314, 
/Metadata=24 0 R, /Length=125, /Height=323, /Filter=/FlateDecode}

> The reference to the layer should be imgObject.get(PdfName.OC);
PdfImageObject image = renderInfo.getImage();
PdfObject oc = image.get(PdfName.OC);
//oc == null

> Depending on the tool that creates the PDF, the info about the layers can be:
> [1] part of the content stream of the page

If the layerinformation is part of the Page Stream, am I right that I should 
use any of the following methods and then somehow parse the content?
Byte[] content = reader.getPageContent(1);
PdfContentByte content = writer.getDirectContent();

> Once you have a Matrix object, you can retrieve the X and Y translation:
> float x = matrix.get(Matrix.I31);
> float y = matrix.get(Matrix.I32);
Thank you, this tip will be help full later on when I need the absolute 
position of each image and it should work in my evaluation example (left, 
right). However in the real world application after this evaluation there might 
be up to 20 images which could be placed more or less anywhere in the document 
so this solution would mean a lot of guessing when finding the correct image.

Once again thank you for your answer.
Cheers
Lukas Johansson

________________________________
From: 1T3XT BVBA [[email protected]<mailto:[email protected]>]
Sent: Sunday, August 07, 2011 09:04
To: Post all your questions about iText here
Subject: Re: [iText-questions] Parsing PDF retrieving layers and images

On 6/08/2011 19:11, Lukas Johansson wrote:
Hello,
I'm currently evaluating iTextPDF for a project and now I'm stuck with a 
problem.

Looking at the things you've tried, you've read the book well.
Now you need to know some advanced stuff.

My use case requires me to find the left image (PDF format) in the document and 
do some stuff with it and then do the same thing with the right image. The 
PDF-document is created by In-design and the images does not have any specialId.

I don't know if you can add extra IDs to stream dictionaries containing images.
Maybe, maybe not. So let's look at other options.

At first I just thought I would traverse all images and then compare there 
filenames to see if I found the right one, but as I understand the filenames 
are not preserved when adding them to a PDF-document.

No, file names are not preserved.
Moreover some images types are converted to another image type before they are 
added to a PDF.
For instance: a PNG will be converted to another type of image (what type? that 
depends on the tool used to create the PDF).

I then tried to use the image's metadata (which I can see in XML if I look at 
the file in a texteditor) by addin a title attribute and check against that, 
but I couldn't find how to get hold on this metadata from a PdfObject/PdfImage.

If the image type is converted to another type of image, chances are that the 
XML has disappeared.
We'd have to see a PDF to make sure if the XML is still there.
If it is, you need to get the PRStream of the Image and get the bytes of that 
stream.

I then placed the images in there own layer called left and right and tried to 
either traverse the layers to find the each layers image or traversing all 
images and checking what layer they belong to.

Depending on the tool that creates the PDF, the info about the layers can be:
[1] part of the content stream of the page
[2] an entry of the stream object of the image
If [1] is the case, then you'll have a lot of work to parse the content stream.
If [2] is the case, you can use stream.get(PdfName.OC); to find a reference to 
the Optional Content dictionary.
Once you have the Optional Content dictionary, you know what layer the image 
belongs to.
Note that [2] is preferred over [1] when creating PDFs with images that belong 
to a specific layer.

* Found the Image element in two ways
   1.  int n = reader.getXrefSize();
       PdfObject object;
       PRStream stream;
       for (int i = 0; i < n; i++) {
           object = reader.getPdfObject(i);
           stream = (PRStream)object;
           if (stream.get(PdfName.SUBTYPE) != null &&  
stream.get(PdfName.SUBTYPE).equals(PdfName.IMAGE)){
               PdfImageObject image = new PdfImageObject(stream);
           }
       }

That's the "dirty" way: you loop over ALL the objects in the PDF.
This way, you may even find images that aren't even shown on any page in the 
PDF.

   2. Using PdfReaderContentParser as in the ExtractImages example.

I think this is the better way.
You're missing only one little piece of information (not mentioned in the book).
See 
http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/ImageRenderInfo.html
Currently, you retrieve the PdfImageObject with the getImage() method.
To find out its location, you should also use the getImageCTM() method.
CTM is short for Current Transformation Matrix.
Once you have a Matrix object, you can retrieve the X and Y translation:
float x = matrix.get(Matrix.I31);
float y = matrix.get(Matrix.I32);

   However, I haven't been able to find any reference to the layer in the 
PdfImageObjects that I retrieve.

The reference to the layer should be imgObject.get(PdfName.OC);

I would really appreciate any pointers how to proceed with this.

I hope the above answers help you on the way.
Putting the images inside a layer is a good idea,
but please try the getImageCTM() first and let us know if it works as expected.
Feedback is always appreciated.

------------------------------------------------------------------------------
BlackBerry&reg; DevCon Americas, Oct. 18-20, San Francisco, CA
The must-attend event for mobile developers. Connect with experts. 
Get tools for creating Super Apps. See the latest technologies.
Sessions, hands-on labs, demos & much more. Register early & save!
http://p.sf.net/sfu/rim-blackberry-1

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Re: [iText-questions] Parsing PDF retrieving layers and images

Reply via email to