On Wednesday 28 July 2004 15:44, Paul Smith wrote: > The first thing that I would do is wrap the FileInputStream with a > BufferedInputStream. .... > You get a significant boost reading in from a buffer, particularly as the > size of the file grows.
Benchmarking is good; whether there's any significant performance difference depends on how the app reads data from the stream. Most high-performance apps read straight to a local buffer, in which case BufferedInputStream offers nothing more than the buffer overhead... :-) You do get nice performance improvement only if most reads are done using single byte read methods (read()), though, so there's always a chance I guess. -+ Tatu +- > Try that first, and then rebenchmark. > Cheers > Paul Smith > > > -----Original Message----- > > From: Miroslaw Milewski [mailto:[EMAIL PROTECTED] > > Sent: Thursday, July 29, 2004 7:24 AM > > To: [EMAIL PROTECTED] > > Subject: pdfbox performance. > > > > > > Hi, > > > > I have a serious performance problem while extracting text from pdf. > > > > Here is the code (w/o try/catch blocks): > > > > File file = new File("test.pdf"); > > FileInputStream reader = new FileInputStream(file); > > > > PDFParser parser = new PDFParser(reader); > > parser.parse(); > > PDDocument pdDoc = parser.getPDDocument(); > > > > PDFTextStripper stripper = new PDFTextStripper(); > > String pdftext = stripper.getText(pdDoc); > > > > pdDoc.close(); > > > > Now, the whole process takes: > > - 37,4 sec w. a 74 kB file (parsing took 5,3 sec.) > > - 156,7 sec w. a 150 kB file (parsing: 11,0 sec.) > > - 157,8 sec w. a 270 kB file (parsing: 34,3 sec.) > > - 313,3 sec w. a 151 kB file (parsing: 5,9 sec.) > > > > Now, I can't really get the point here. Is this performance standard > > for pdfbox? Or is it my system (win2k, PIII 700, 512 RAM), or the code, > > or maybe the pdf docs (text only, the last one with some UML diags.) > > > > I am writing a knowledge base system at the moment, and planned to do > > real-time text extraction and indexing (using Lucene.) But this is not > > realistic, considering the extraction thime. > > Then maybe it is a better idea to run the extraction and indexing once > > every 24 h, processing all the documents added during that period. > > > > TIA for any comments/suggestions. > > > > -- > > Miroslaw Milewski > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
