Re: Algorithm of retrieving docs

Harshvardhan Ojha Thu, 13 Feb 2014 20:30:47 -0800

Thanks Michael for help, this helped me with my problem.

Regards
Harshvardhan Ojha



On Thu, Feb 13, 2014 at 8:51 PM, Michael McCandless <
[email protected]> wrote:

> The bloom filter is only used by the postings format wrapper, and
> we've had mixed results on whether it helps performance or not (seems
> to depend heavily on the exact usage).
>
> We have bit set / iterator abstractions (oal.util.Bits,
> oal.search.DocIdSet/Iterator) to manage "sets" of documents, but most
> implementations don't use a hash set under the hood.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Feb 13, 2014 at 7:11 AM, Harshvardhan Ojha
> <[email protected]> wrote:
> > Hi Mike/Mikhail,
> >
> > Don't you guys think org.apache.lucene.codecs.bloom.FuzzySet.java,
> > contains(BytesRef value) methods returns probablity of having a field,
> and
> > it is a place where we are using hashing ?
> >
> > Are there any other place in source which when given with document id,
> could
> > determine by calculating its hash and say if document with this id is
> > present or not in a single lookup O(1) ?
> >
> > Regards
> > Harshvardhan Ojha
> >
> >
> > On Thu, Feb 13, 2014 at 5:11 PM, Michael McCandless
> > <[email protected]> wrote:
> >>
> >> Lucene only assigns its int docID during indexing.
> >>
> >> Retrieving a previously stored document is a O(1), but that involves a
> >> disk seek which can be very costly when the page is not in the OS's IO
> >> cache.  Lucene does not do any caching itself (relies on the OS
> >> instead).
> >>
> >> Have a look at the current default stored fields codec format:
> >>
> >>
> lucene/core/src/java/org/apache/lucene/codec/lucene41/Lucene41StoredFieldsFormat
> >> for details.
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Wed, Feb 12, 2014 at 11:27 PM, Harshvardhan Ojha
> >> <[email protected]> wrote:
> >> > Hi All,
> >> >
> >> > I have a question regarding retrieval of documents by lucene.
> >> > I know lucene uses many files on disk to keep documents, each
> comprising
> >> > fields in it, and uses many IR algorithms, and inverted index to match
> >> > documents.
> >> >
> >> > My question is :
> >> > 1. How lucene stores these documents inside file system and gets it so
> >> > fast?
> >> > 2. Does lucene uses any Hashing algorithm to get docs in O(1) ? If not
> >> > which
> >> > DS is         used by lucene ?
> >> > 3. Except id provided by us at the time of indexing, is there any
> other
> >> > unique identifier       which is assigned by lucene to its documents ?
> >> >
> >> > I will appreciate If someone can provide me with source file names to
> >> > study
> >> > these algorithms in detail.
> >> >
> >> > Regards
> >> > Harshvardhan Ojha
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Algorithm of retrieving docs

Reply via email to