I'm working with the PdfTextExtractor to extract text from a specific region of
a PDF document using the code below.
PdfReader reader = new PdfReader(strPdfIn);
int iPages = reader.getNumberOfPages();
Rectangle rect = new Rectangle(10,10,30,360);
RenderFilter filter = new RegionTextRenderFilter(rect);
TextExtractionStrategy strategy;
strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(),
filter);
for (int i=1; i<=iPages; i++)
{
System.out.println("Getting Page:[" + i + "] of [" + iPages + "]");
String strExtract = PdfTextExtractor.getTextFromPage(reader, i, strategy);
System.out.println("strExtract:[" + strExtract + "]");
}
The source PDF is 31,576 pages. Everything works fine until I reach page
26,402 and it throws the following exception.
ExceptionConverter:
Completed...com.itextpdf.text.exceptions.InvalidPdfException: '>' not expected
at file pointer 191980
at com.itextpdf.text.pdf.PRTokeniser.throwError(PRTokeniser.java:205)
at com.itextpdf.text.pdf.PRTokeniser.nextToken(PRTokeniser.java:358)
at
com.itextpdf.text.pdf.PdfContentParser.nextValidToken(PdfContentParser.java:196)
at
com.itextpdf.text.pdf.PdfContentParser.readPRObject(PdfContentParser.java:166)
at com.itextpdf.text.pdf.PdfContentParser.parse(PdfContentParser.java:89)
at
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:365)
at
com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:79)
at
com.itextpdf.text.pdf.parser.PdfTextExtractor.getTextFromPage(PdfTextExtractor.java:73)
I isolated the above error to a specific page and on this page it consisted of
a TIFF image with no visible text. The document opens fine in Acrobat with no
errors. I extracted the page that generates the exception using Acrobat Pro X
and the problem is still present. In another PDF document, there is text with
a TIFF image and it fails with the exact same error.
It is my understanding that the PDF was generated by HP Exstream.
Any suggestions would be greatly appreciated.
Bill
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions
iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples:
http://itextpdf.com/themes/keywords.php