some basic questions on how Lucene/search engines work

Yang Thu, 07 Apr 2011 21:18:02 -0700

I'm new to lucene/search engine , and have been struggling with these
questions recently.
I'd appreciate a lot of you could shed some light on this.



let's say I do a query on

dog   greyhound

note that I did not quote them, i.e. this is not a phrase search.

what happens under the hood ?

which term does Lucene use to look up the inverted Index ?
I read somewhere that Lucene uses the term with the higher IDF (i.e.
the more distinguishing term), i.e. in this case
"greyhound", but what about dog? does Lucene traverse down the doclist
of  "dog" at all? if I provide multiple terms in my query,
generally how does Lucene decide how many doclists to travel down?


I read that Lucene uses a combination of "binary model" and  VSM, then
it seems that in the above case, it finds
the full doclist of dog , and that of "greyhound", (the binary model
part), then find the common docs from the two doclists,
then order them by scores ( the VSM part).  is it true that the FULL
doclists are fetched first? or is some pruning done on the individual
doclists? I see the
talk in http://www.slideshare.net/abial/eurocon2010  that talks about
pruning and tiered search, but is this the default behavior of Lucene?
how are the doclists sorted? (by  idf ?? --- sorry I'm just beginning
to sift through a lot of docs online, somehow got this impression but
can't form a precise conclusion)



also generally, could you please provide some good articles on how
lucene/search engines work? I've read the "anatomy of a search engine"
(google Sergey Brin & Larry Page paper),
"introduction to information retrieval (Manning et al ) "  , "Lucene
in action" ....


Thanks
Yang

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

some basic questions on how Lucene/search engines work

Reply via email to