Lucene (using 3.5) seems to be caching field values for documents (after they 
have been retrieved) and I am hoping someone can provide more information on 
how and where exactly the field values are stored.

The table below lists the times (in milliseconds) associated with retrieving 
for a set of documents matching a particular query a single stored value from 
each document in the set. Results are shown for three queries (A, B, and C) 
submitted multiple times. The first time each query is submitted, the time to 
retrieve it's matching document values is considerably longer than any time 
after that.

1) search A          nDocs =                489         time =   1342
2) search A          nDocs =                489         time =   811
3) search B          nDocs =                47038    time =   76658
4) search B          nDocs =                47038    time =   1062
5) search C          nDocs =                5256       time =   22741
6) search C          nDocs =                5256       time =   578
7) search A          nDocs =                489         time =   515
8) search A          nDocs =                489         time =   514
9) search B          nDocs =                47038    time =   1000
10) search B        nDocs =                47038    time =   967
11) search C        nDocs =                5256       time =   563
12) search C        nDocs =                5256       time =   562


Whatever information that is being cached is available across separate 
processes so presumably it is residing somewhere in the file system (and/or 
virtual memory). I have also seen the same behavior when retrieving 
TermFreqVector information as well.

Any additional insight is appreciated!

Thanks,
Stuart


__________________________________________________
Stuart Rose
Senior Research Engineer
Pacific Northwest National Laboratory

Reply via email to