Hi,

 I have a serious performance problem while extracting text from pdf.

 Here is the code (w/o try/catch blocks):

 File file = new File("test.pdf");
 FileInputStream reader = new FileInputStream(file);

 PDFParser parser = new PDFParser(reader);
 parser.parse();
 PDDocument pdDoc = parser.getPDDocument();

 PDFTextStripper stripper = new PDFTextStripper();
 String pdftext = stripper.getText(pdDoc);

 pdDoc.close();

 Now, the whole process takes:
 - 37,4 sec w. a 74 kB file (parsing took 5,3 sec.)
 - 156,7 sec w. a 150 kB file (parsing: 11,0 sec.)
 - 157,8 sec w. a 270 kB file (parsing: 34,3 sec.)
 - 313,3 sec w. a 151 kB file (parsing: 5,9 sec.)

Now, I can't really get the point here. Is this performance standard for pdfbox? Or is it my system (win2k, PIII 700, 512 RAM), or the code, or maybe the pdf docs (text only, the last one with some UML diags.)

I am writing a knowledge base system at the moment, and planned to do real-time text extraction and indexing (using Lucene.) But this is not realistic, considering the extraction thime.
Then maybe it is a better idea to run the extraction and indexing once every 24 h, processing all the documents added during that period.


 TIA for any comments/suggestions.

--
        Miroslaw Milewski


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to