Re: [iText-questions] Parsing PDF retrieving layers and images

1T3XT BVBA Sun, 07 Aug 2011 00:04:33 -0700

On 6/08/2011 19:11, Lukas Johansson wrote:

Hello,
I'm currently evaluating iTextPDF for a project and now I'm stuck witha problem.


Looking at the things you've tried, you've read the book well.
Now you need to know some advanced stuff.

My use case requires me to find the left image (PDF format) in thedocument and do some stuff with it and then do the same thing with theright image. The PDF-document is created by In-design and the imagesdoes not have any specialId.

I don't know if you can add extra IDs to stream dictionaries containingimages.

Maybe, maybe not. So let's look at other options.

At first I just thought I would traverse all images and then comparethere filenames to see if I found the right one, but as I understandthe filenames are not preserved when adding them to a PDF-document.


No, file names are not preserved.

Moreover some images types are converted to another image type beforethey are added to a PDF.For instance: a PNG will be converted to another type of image (whattype? that depends on the tool used to create the PDF).

I then tried to use the image's metadata (which I can see in XML if Ilook at the file in a texteditor) by addin a title attribute and checkagainst that, but I couldn't find how to get hold on this metadatafrom a PdfObject/PdfImage.

If the image type is converted to another type of image, chances arethat the XML has disappeared.

We'd have to see a PDF to make sure if the XML is still there.

If it is, you need to get the PRStream of the Image and get the bytes ofthat stream.

I then placed the images in there own layer called left and right andtried to either traverse the layers to find the each layers image ortraversing all images and checking what layer they belong to.

Depending on the tool that creates the PDF, the info about the layerscan be:

[1] part of the content stream of the page
[2] an entry of the stream object of the image

If [1] is the case, then you'll have a lot of work to parse the contentstream.If [2] is the case, you can use stream.get(PdfName.OC); to find areference to the Optional Content dictionary.Once you have the Optional Content dictionary, you know what layer theimage belongs to.Note that [2] is preferred over [1] when creating PDFs with images thatbelong to a specific layer.

* Found the Image element in two ways
   1. int n = reader.getXrefSize();
       PdfObject object;
       PRStream stream;
       for (int i = 0; i < n; i++) {
           object = reader.getPdfObject(i);
           stream = (PRStream)object;

if (stream.get(PdfName.SUBTYPE) != null &&stream.get(PdfName.SUBTYPE).equals(PdfName.IMAGE)){

               PdfImageObject image = new PdfImageObject(stream);
           }
       }


That's the "dirty" way: you loop over ALL the objects in the PDF.

This way, you may even find images that aren't even shown on any page inthe PDF.

   2. Using PdfReaderContentParser as in the ExtractImages example.


I think this is the better way.

You're missing only one little piece of information (not mentioned inthe book).Seehttp://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/ImageRenderInfo.html

Currently, you retrieve the PdfImageObject with the getImage() method.
To find out its location, you should also use the getImageCTM() method.
CTM is short for Current Transformation Matrix.
Once you have a Matrix object, you can retrieve the X and Y translation:
float x = matrix.get(Matrix.I31);
float y = matrix.get(Matrix.I32);

However, I haven't been able to find any reference to the layer inthe PdfImageObjects that I retrieve.


The reference to the layer should be imgObject.get(PdfName.OC);

I would really appreciate any pointers how to proceed with this.


I hope the above answers help you on the way.
Putting the images inside a layer is a good idea,

but please try the getImageCTM() first and let us know if it works asexpected.

Feedback is always appreciated.

------------------------------------------------------------------------------
BlackBerry&reg; DevCon Americas, Oct. 18-20, San Francisco, CA
The must-attend event for mobile developers. Connect with experts. 
Get tools for creating Super Apps. See the latest technologies.
Sessions, hands-on labs, demos & much more. Register early & save!
http://p.sf.net/sfu/rim-blackberry-1

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Re: [iText-questions] Parsing PDF retrieving layers and images

Reply via email to