Hi Mike/Mikhail, Don't you guys think org.apache.lucene.codecs.bloom.FuzzySet.java, contains(BytesRef value) methods returns probablity of having a field, and it is a place where we are using hashing ?
Are there any other place in source which when given with document id, could determine by calculating its hash and say if document with this id is present or not in a single lookup O(1) ? Regards Harshvardhan Ojha On Thu, Feb 13, 2014 at 5:11 PM, Michael McCandless < [email protected]> wrote: > Lucene only assigns its int docID during indexing. > > Retrieving a previously stored document is a O(1), but that involves a > disk seek which can be very costly when the page is not in the OS's IO > cache. Lucene does not do any caching itself (relies on the OS > instead). > > Have a look at the current default stored fields codec format: > > lucene/core/src/java/org/apache/lucene/codec/lucene41/Lucene41StoredFieldsFormat > for details. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Wed, Feb 12, 2014 at 11:27 PM, Harshvardhan Ojha > <[email protected]> wrote: > > Hi All, > > > > I have a question regarding retrieval of documents by lucene. > > I know lucene uses many files on disk to keep documents, each comprising > > fields in it, and uses many IR algorithms, and inverted index to match > > documents. > > > > My question is : > > 1. How lucene stores these documents inside file system and gets it so > fast? > > 2. Does lucene uses any Hashing algorithm to get docs in O(1) ? If not > which > > DS is used by lucene ? > > 3. Except id provided by us at the time of indexing, is there any other > > unique identifier which is assigned by lucene to its documents ? > > > > I will appreciate If someone can provide me with source file names to > study > > these algorithms in detail. > > > > Regards > > Harshvardhan Ojha > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
