Re: some basic questions on how Lucene/search engines work

Yang Wed, 13 Apr 2011 08:52:59 -0700

thanks a lot for the detailed info!

On Wed, Apr 13, 2011 at 4:43 AM, Grant Ingersoll <gsing...@apache.org> wrote:
>
> On Apr 7, 2011, at 9:17 PM, Yang wrote:
>
>> I'm new to lucene/search engine , and have been struggling with these
>> questions recently.
>> I'd appreciate a lot of you could shed some light on this.
>>
>>
>> let's say I do a query on
>>
>> dog   greyhound
>>
>> note that I did not quote them, i.e. this is not a phrase search.
>>
>> what happens under the hood ?
>
> A lot and it is really beyond the scope of an email, but...
>
> Assuming dog and greyhound are not removed by analysis, it will go to the 
> inverted index and look up each of the terms and the documents for each of 
> the terms and score them (see BooleanQuery for how it combines them).  I 
> would suggest setting up a really simple program that does this query and 
> then just step through the code.  You may even want to start a little simpler 
> with a single term.
>
>>
>> which term does Lucene use to look up the inverted Index ?
>
> Both
>
>
>> I read somewhere that Lucene uses the term with the higher IDF (i.e.
>> the more distinguishing term), i.e. in this case
>> "greyhound", but what about dog? does Lucene traverse down the doclist
>> of  "dog" at all?
>
> Yes, it does.
>
>> if I provide multiple terms in my query,
>> generally how does Lucene decide how many doclists to travel down?
>>
>
> I don't believe we currently do any pruning, so all will be visited.
>
>>
>> I read that Lucene uses a combination of "binary model"
>
> I would say "Boolean Model"
>
>> and  VSM, then
>> it seems that in the above case, it finds
>> the full doclist of dog , and that of "greyhound", (the binary model
>> part), then find the common docs from the two doclists,
>> then order them by scores ( the VSM part).
>
> Yes, assuming an AND combines them.  If it's an OR, then it will union the 
> lists.
>
>>  is it true that the FULL
>> doclists are fetched first? or is some pruning done on the individual
>> doclists?
>
> All docs that contain the word are scored and can potentially be returned if 
> you have enough time/memory (not recommended)
>
>> I see the
>> talk in http://www.slideshare.net/abial/eurocon2010  that talks about
>> pruning and tiered search, but is this the default behavior of Lucene?
>
> That is more advanced and is something you can do w/ Lucene, but is not out 
> of the box.
>
>> how are the doclists sorted? (by  idf ?? --- sorry I'm just beginning
>> to sift through a lot of docs online, somehow got this impression but
>> can't form a precise conclusion)
>>
>
> It depends on the query, etc.  Have a look at the Scoring page on the website 
> for more info.
>
>>
>>
>> also generally, could you please provide some good articles on how
>> lucene/search engines work? I've read the "anatomy of a search engine"
>> (google Sergey Brin & Larry Page paper),
>> "introduction to information retrieval (Manning et al ) "  , "Lucene
>> in action" ....
>
> I'd start w/ Lucene in Action 2nd ed.  Brin and Page paper is good.  As is 
> the Manning book, Baeza Yates, Grossman, etc.  I believe we have a resources 
> page on our Wiki that lists out a lot of books and talks.
>
> I would recommend, however, just trying out small examples of different 
> queries and stepping through the code if you really want to know the guts of 
> Lucene.
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem docs using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: some basic questions on how Lucene/search engines work

Reply via email to