Re: Algorithm of retrieving docs

Michael McCandless Thu, 13 Feb 2014 03:42:11 -0800

Lucene only assigns its int docID during indexing.

Retrieving a previously stored document is a O(1), but that involves a
disk seek which can be very costly when the page is not in the OS's IO
cache.  Lucene does not do any caching itself (relies on the OS
instead).


Have a look at the current default stored fields codec format:
lucene/core/src/java/org/apache/lucene/codec/lucene41/Lucene41StoredFieldsFormat
for details.

Mike McCandless

http://blog.mikemccandless.com


On Wed, Feb 12, 2014 at 11:27 PM, Harshvardhan Ojha
<[email protected]> wrote:
> Hi All,
>
> I have a question regarding retrieval of documents by lucene.
> I know lucene uses many files on disk to keep documents, each comprising
> fields in it, and uses many IR algorithms, and inverted index to match
> documents.
>
> My question is :
> 1. How lucene stores these documents inside file system and gets it so fast?
> 2. Does lucene uses any Hashing algorithm to get docs in O(1) ? If not which
> DS is         used by lucene ?
> 3. Except id provided by us at the time of indexing, is there any other
> unique identifier       which is assigned by lucene to its documents ?
>
> I will appreciate If someone can provide me with source file names to study
> these algorithms in detail.
>
> Regards
> Harshvardhan Ojha
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Algorithm of retrieving docs

Reply via email to