Re: pdfbox performance.

Miroslaw Milewski Wed, 28 Jul 2004 16:15:19 -0700

Paul Smith wrote:

 > The first thing that I would do is wrap the FileInputStream with a
 > BufferedInputStream.
 > Change:
 > > FileInputStream reader = new FileInputStream(file);
 > To:
 > InputStream reader = new BufferedInputStream(new
 > FileInputStream(file));
 > You get a significant boost reading in from a buffer, particularly as
 > the size of the file grows. Try that first, and then rebenchmark.

 I tested both, here is the code:

File file = new File("test.pdf");
InputStream reader = null;

for(int i=1; i<=6; i++) {

  long step01 = Calendar.getInstance().getTimeInMillis();
  String stream = null;
        
  if(i%2 == 0) {
    reader = new BufferedInputStream(new FileInputStream(file));
      stream = "buffered";
  }
  else {
    reader = new FileInputStream(file);
    stream = "no buffer";
  }

  PDFParser parser = null;
  PDDocument pdDoc = null;

  parser = new PDFParser(reader);
  parser.parse();
  pdDoc = parser.getPDDocument();

  long step02 = Calendar.getInstance().getTimeInMillis();

  PDFTextStripper stripper = new PDFTextStripper();
  tring pdftext = stripper.getText(pdDoc);

  long step03 = Calendar.getInstance().getTimeInMillis();

  pdDoc.close();

  long end = Calendar.getInstance().getTimeInMillis();

  System.out.println("iteration: " + i + " - " + stream);
  System.out.println("start: " + start);
  System.out.println("step01: " + (step01-start));
  System.out.println("step02: " + (step02-start));
  System.out.println("step03: " + (step03-start));
  System.out.println("end: " + (end-start));
}

And below are the benchmarks for buffered and unbuffered readers. The difference is not stunning. It seems to get better with time, but this is prably due to some VM optimisation. And I'll extract the text only once :-).

file: 9kB, text only;

iteration: 1 - no buffer
step01: 0; step02: 1492; step03: 13850; end: 13880

iteration: 2 - buffered
step01: 0; step02: 912; step03: 10245; end: 10265

iteration: 3 - no buffer
step01: 0; step02: 951 ;step03: 9924; end: 9944

iteration: 4 - buffered
step01: 0; step02: 842; step03: 10075; end: 10105

iteration: 5 - no buffer
step01: 0; step02: 831; step03: 9934; end: 9954

iteration: 6 - buffered
step01: 0; step02: 932; step03: 9944; end: 9965


file: 74 kB; text only

iteration: 1 - no buffer
step01: 0; step02: 4918; step03: 33959; end: 33989

iteration: 2 - buffered
step01: 0; step02: 4367; step03: 32367; end: 32407

iteration: 3 - no buffer
step01: 0; step02: 4306; step03: 30995; end: 31025

iteration: 4 - buffered
step01: 0; step02: 4296; step03: 30734; end: 30764

iteration: 5 - no buffer
step01: 0; step02: 4266; step03: 30754; end: 30784

iteration: 6 - buffered
step01: 0; step02: 4256; step03: 30634; end: 30664


file: 270 kB, text only

iteration: 1 - no buffer
step01: 0; step02: 30634; step03: 142225; end: 142265

iteration: 2 - buffered
step01: 0; step02: 29893; step03: 135354; end: 135394

iteration: 3 - no buffer
step01: 0; step02: 29553; step03: 134654; end: 134694

iteration: 4 - buffered
step01: 0; step02: 29613; step03: 134944; end: 134984

iteration: 5 - no buffer
step01: 0; step02: 29543; step03: 139070; end: 139110

iteration: 6 - buffered
step01: 0; step02: 32427; step03: 150457; end: 150487

Anyway, I suppose I made a wrong assumption while designing my app. I don't think I can get a performance boost of 90% or so. Thus the documents (at least the .pdfs) won't be extracted and indexed at the time of adding them to the knowledge base. Since I also have a db involved, I can keep the basic data there, and extract and index in the meantime - most likely using a different thread.

 thx,
--
        Miroslaw Milewski


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: pdfbox performance.

Reply via email to