Lucene only assigns its int docID during indexing. Retrieving a previously stored document is a O(1), but that involves a disk seek which can be very costly when the page is not in the OS's IO cache. Lucene does not do any caching itself (relies on the OS instead).
Have a look at the current default stored fields codec format: lucene/core/src/java/org/apache/lucene/codec/lucene41/Lucene41StoredFieldsFormat for details. Mike McCandless http://blog.mikemccandless.com On Wed, Feb 12, 2014 at 11:27 PM, Harshvardhan Ojha <[email protected]> wrote: > Hi All, > > I have a question regarding retrieval of documents by lucene. > I know lucene uses many files on disk to keep documents, each comprising > fields in it, and uses many IR algorithms, and inverted index to match > documents. > > My question is : > 1. How lucene stores these documents inside file system and gets it so fast? > 2. Does lucene uses any Hashing algorithm to get docs in O(1) ? If not which > DS is used by lucene ? > 3. Except id provided by us at the time of indexing, is there any other > unique identifier which is assigned by lucene to its documents ? > > I will appreciate If someone can provide me with source file names to study > these algorithms in detail. > > Regards > Harshvardhan Ojha > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
