The first thing that I would do is wrap the FileInputStream with a BufferedInputStream. Change: > FileInputStream reader = new FileInputStream(file); To: InputStream reader = new BufferedInputStream(new FileInputStream(file)); You get a significant boost reading in from a buffer, particularly as the size of the file grows. Try that first, and then rebenchmark. Cheers Paul Smith > -----Original Message----- > From: Miroslaw Milewski [mailto:[EMAIL PROTECTED] > Sent: Thursday, July 29, 2004 7:24 AM > To: [EMAIL PROTECTED] > Subject: pdfbox performance. > > > Hi, > > I have a serious performance problem while extracting text from pdf. > > Here is the code (w/o try/catch blocks): > > File file = new File("test.pdf"); > FileInputStream reader = new FileInputStream(file); > > PDFParser parser = new PDFParser(reader); > parser.parse(); > PDDocument pdDoc = parser.getPDDocument(); > > PDFTextStripper stripper = new PDFTextStripper(); > String pdftext = stripper.getText(pdDoc); > > pdDoc.close(); > > Now, the whole process takes: > - 37,4 sec w. a 74 kB file (parsing took 5,3 sec.) > - 156,7 sec w. a 150 kB file (parsing: 11,0 sec.) > - 157,8 sec w. a 270 kB file (parsing: 34,3 sec.) > - 313,3 sec w. a 151 kB file (parsing: 5,9 sec.) > > Now, I can't really get the point here. Is this performance standard > for pdfbox? Or is it my system (win2k, PIII 700, 512 RAM), or the code, > or maybe the pdf docs (text only, the last one with some UML diags.) > > I am writing a knowledge base system at the moment, and planned to do > real-time text extraction and indexing (using Lucene.) But this is not > realistic, considering the extraction thime. > Then maybe it is a better idea to run the extraction and indexing once > every 24 h, processing all the documents added during that period. > > TIA for any comments/suggestions. > > -- > Miroslaw Milewski > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
