Dear,
I'm using itext to extract specific text from a PDF. However when extracting from some PDF's the coordinates I specify are ignored. When selecting a rectangular of only 1 postscript-point, a much larger text block is extracted. I found out that the problem occurs when exporting excel with some PDF generation tools (eg CutePDF): it is the whole table cell wherein those coordinates locate that gets extracted.
Attached to this e-mail en example PDF file.
This is the code I use for the extraction:
iimport com.google.common.io.Closeables;
import com.google.common.io.Files;
import com.google.common.io.InputSupplier;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.Rectangle;
import com.itextpdf.text.pdf.parser.*;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.io.StringWriter;
public class itext {
public static void main(String[] args) throws IOException {
InputSupplier<? extends InputStream> pdf =
Files.newInputStreamSupplier(new File("src/main/resources/test.pdf"));
int pageNumber = 1;
int llx = 200;
int lly = 776;
int urx = 201;
int ury = 777;
Rectangle rect = new Rectangle(llx, lly, urx, ury);
InputStream in = pdf.getInput();
PdfReader reader = null;
try {
reader = new PdfReader(in);
RenderFilter filter = new RegionTextRenderFilter(rect);
TextExtractionStrategy strategy = new
FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
StringWriter out = new StringWriter();
out.write(PdfTextExtractor.getTextFromPage(reader,
pageNumber, strategy));
System.out.println(">" + out + "<");
} finally {
if (reader != null) {
reader.close();
}
Closeables.closeQuietly(in);
}
}
}
What is going wrong here? And how can I force itext to stick to the
correct coordinates?
An answer to change PDF generation tool does not help me, because that
lies beyond my control.
Kind regards, Cyrille Bartholomee
test.pdf
Description: Adobe PDF document
------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
