Paul Smith wrote:
> The first thing that I would do is wrap the FileInputStream with a > BufferedInputStream. > Change: > > FileInputStream reader = new FileInputStream(file); > To: > InputStream reader = new BufferedInputStream(new > FileInputStream(file)); > You get a significant boost reading in from a buffer, particularly as > the size of the file grows. Try that first, and then rebenchmark.
I tested both, here is the code:
File file = new File("test.pdf");
InputStream reader = null;for(int i=1; i<=6; i++) { long step01 = Calendar.getInstance().getTimeInMillis();
String stream = null;
if(i%2 == 0) {
reader = new BufferedInputStream(new FileInputStream(file));
stream = "buffered";
}
else {
reader = new FileInputStream(file);
stream = "no buffer";
}PDFParser parser = null; PDDocument pdDoc = null;
parser = new PDFParser(reader); parser.parse(); pdDoc = parser.getPDDocument();
long step02 = Calendar.getInstance().getTimeInMillis();
PDFTextStripper stripper = new PDFTextStripper(); tring pdftext = stripper.getText(pdDoc);
long step03 = Calendar.getInstance().getTimeInMillis();
pdDoc.close();
long end = Calendar.getInstance().getTimeInMillis();
System.out.println("iteration: " + i + " - " + stream);
System.out.println("start: " + start);
System.out.println("step01: " + (step01-start));
System.out.println("step02: " + (step02-start));
System.out.println("step03: " + (step03-start));
System.out.println("end: " + (end-start));
}And below are the benchmarks for buffered and unbuffered readers. The difference is not stunning. It seems to get better with time, but this is prably due to some VM optimisation. And I'll extract the text only once :-).
file: 9kB, text only;
iteration: 1 - no buffer step01: 0; step02: 1492; step03: 13850; end: 13880
iteration: 2 - buffered step01: 0; step02: 912; step03: 10245; end: 10265
iteration: 3 - no buffer step01: 0; step02: 951 ;step03: 9924; end: 9944
iteration: 4 - buffered step01: 0; step02: 842; step03: 10075; end: 10105
iteration: 5 - no buffer step01: 0; step02: 831; step03: 9934; end: 9954
iteration: 6 - buffered step01: 0; step02: 932; step03: 9944; end: 9965
file: 74 kB; text only
iteration: 1 - no buffer step01: 0; step02: 4918; step03: 33959; end: 33989
iteration: 2 - buffered step01: 0; step02: 4367; step03: 32367; end: 32407
iteration: 3 - no buffer step01: 0; step02: 4306; step03: 30995; end: 31025
iteration: 4 - buffered step01: 0; step02: 4296; step03: 30734; end: 30764
iteration: 5 - no buffer step01: 0; step02: 4266; step03: 30754; end: 30784
iteration: 6 - buffered step01: 0; step02: 4256; step03: 30634; end: 30664
file: 270 kB, text only
iteration: 1 - no buffer step01: 0; step02: 30634; step03: 142225; end: 142265
iteration: 2 - buffered step01: 0; step02: 29893; step03: 135354; end: 135394
iteration: 3 - no buffer step01: 0; step02: 29553; step03: 134654; end: 134694
iteration: 4 - buffered step01: 0; step02: 29613; step03: 134944; end: 134984
iteration: 5 - no buffer step01: 0; step02: 29543; step03: 139070; end: 139110
iteration: 6 - buffered step01: 0; step02: 32427; step03: 150457; end: 150487
Anyway, I suppose I made a wrong assumption while designing my app. I don't think I can get a performance boost of 90% or so. Thus the documents (at least the .pdfs) won't be extracted and indexed at the time of adding them to the knowledge base.
Since I also have a db involved, I can keep the basic data there, and extract and index in the meantime - most likely using a different thread.
thx,
--
Miroslaw Milewski
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
