Text Extraction Does Not Extract Content Beyond First Page ----------------------------------------------------------
Key: PDFBOX-413 URL: https://issues.apache.org/jira/browse/PDFBOX-413 Project: PDFBox Issue Type: Bug Environment: Ubuntu, OpenJDK 6 Reporter: alvin Such as my attempt to extract plain text from PDF using PDFBOX: PDFTextStripper stripper = new PDFTextStripper(); stripper.setStartPage( 1); stripper.setEndPage( 5 ); LucenePDFDocument document = new LucenePDFDocument(); Document luceneDocument = document.convertDocument(file); System.out.println("CONTENTS: "+luceneDocument.get("contents")); --------------------------------------------------------------------------------------------------------------------------------------------------------------------- This is the result I get, and it never goes beyond page 1: --------------------------------------------------------------------------------------------------------------------------------------------------------------------- Document<stored/uncompressed<path:/home/alvin/Desktop/google.pdf> stored/uncompressed<url:/home/alvin/Desktop/google.pdf> stored/uncompressed,indexed<modified:20090130112759> indexed<uid: Web Search Engine Sergey Brin and Lawrence Page Computer Science Department, Stanford University, Stanford, CA 94305, USA ser...@cs.stanford.edu and p...@cs.stanford.edu Abstract In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The proto>> Is it Bug? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.