Text Extraction Does Not Extract Content Beyond First Page
----------------------------------------------------------

                 Key: PDFBOX-413
                 URL: https://issues.apache.org/jira/browse/PDFBOX-413
             Project: PDFBox
          Issue Type: Bug
         Environment: Ubuntu, OpenJDK 6
            Reporter: alvin


Such as my attempt to extract plain text from PDF using PDFBOX:

        PDFTextStripper stripper = new PDFTextStripper();
        stripper.setStartPage( 1);
        stripper.setEndPage( 5 );
        LucenePDFDocument document = new LucenePDFDocument();
        Document luceneDocument = document.convertDocument(file);
        System.out.println("CONTENTS: "+luceneDocument.get("contents"));

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
This is the result I get, and it never goes beyond page 1:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Document<stored/uncompressed<path:/home/alvin/Desktop/google.pdf> 
stored/uncompressed<url:/home/alvin/Desktop/google.pdf> 
stored/uncompressed,indexed<modified:20090130112759> indexed<uid:
Web Search Engine
Sergey Brin and Lawrence Page 
Computer Science Department,
Stanford University, Stanford, CA 94305, USA
ser...@cs.stanford.edu and p...@cs.stanford.edu 
Abstract 
In this paper, we present Google, a prototype of a large-scale search engine 
which makes heavy
use of the structure present in hypertext. Google is designed to crawl and 
index the Web efficiently
and produce much more satisfying search results than existing systems. The 
proto>>

Is it Bug?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to