The first thing that I would do is wrap the FileInputStream with a
BufferedInputStream.
Change: 
>   FileInputStream reader = new FileInputStream(file);
To:
InputStream reader = new BufferedInputStream(new FileInputStream(file));
You get a significant boost reading in from a buffer, particularly as the
size of the file grows.
Try that first, and then rebenchmark.
Cheers
Paul Smith
> -----Original Message-----
> From: Miroslaw Milewski [mailto:[EMAIL PROTECTED]
> Sent: Thursday, July 29, 2004 7:24 AM
> To: [EMAIL PROTECTED]
> Subject: pdfbox performance.
> 
> 
>   Hi,
> 
>   I have a serious performance problem while extracting text from pdf.
> 
>   Here is the code (w/o try/catch blocks):
> 
>   File file = new File("test.pdf");
>   FileInputStream reader = new FileInputStream(file);
> 
>   PDFParser parser = new PDFParser(reader);
>   parser.parse();
>   PDDocument pdDoc = parser.getPDDocument();
> 
>   PDFTextStripper stripper = new PDFTextStripper();
>   String pdftext = stripper.getText(pdDoc);
> 
>   pdDoc.close();
> 
>   Now, the whole process takes:
>   - 37,4 sec w. a 74 kB file (parsing took 5,3 sec.)
>   - 156,7 sec w. a 150 kB file (parsing: 11,0 sec.)
>   - 157,8 sec w. a 270 kB file (parsing: 34,3 sec.)
>   - 313,3 sec w. a 151 kB file (parsing: 5,9 sec.)
> 
>   Now, I can't really get the point here. Is this performance standard
> for pdfbox? Or is it my system (win2k, PIII 700, 512 RAM), or the code,
> or maybe the pdf docs (text only, the last one with some UML diags.)
> 
>   I am writing a knowledge base system at the moment, and planned to do
> real-time text extraction and indexing (using Lucene.) But this is not
> realistic, considering the extraction thime.
>   Then maybe it is a better idea to run the extraction and indexing once
> every 24 h, processing all the documents added during that period.
> 
>   TIA for any comments/suggestions.
> 
> --
>       Miroslaw Milewski
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to