Dear,

I'm using itext to extract specific text from a PDF. However when extracting from some PDF's the coordinates I specify are ignored. When selecting a rectangular of only 1 postscript-point, a much larger text block is extracted. I found out that the problem occurs when exporting excel with some PDF generation tools (eg CutePDF): it is the whole table cell wherein those coordinates locate that gets extracted.

Attached to this e-mail en example PDF file.
This is the code I use for the extraction:

iimport com.google.common.io.Closeables;
import com.google.common.io.Files;
import com.google.common.io.InputSupplier;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.Rectangle;
import com.itextpdf.text.pdf.parser.*;

import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.io.StringWriter;

public class itext {
    public static void main(String[] args) throws IOException {
InputSupplier<? extends InputStream> pdf = Files.newInputStreamSupplier(new File("src/main/resources/test.pdf"));

        int pageNumber = 1;
        int llx = 200;
        int lly = 776;
        int urx = 201;
        int ury = 777;
        Rectangle rect = new Rectangle(llx, lly, urx, ury);

        InputStream in = pdf.getInput();
        PdfReader reader = null;
        try {
            reader = new PdfReader(in);
            RenderFilter filter = new RegionTextRenderFilter(rect);
TextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);

            StringWriter out = new StringWriter();
out.write(PdfTextExtractor.getTextFromPage(reader, pageNumber, strategy));
            System.out.println(">" + out + "<");
        } finally {
            if (reader != null) {
                reader.close();
            }
            Closeables.closeQuietly(in);
        }
    }
}

What is going wrong here? And how can I force itext to stick to the correct coordinates? An answer to change PDF generation tool does not help me, because that lies beyond my control.


Kind regards,
Cyrille Bartholomee

Attachment: test.pdf
Description: Adobe PDF document

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to