thanks a lot for the detailed info! On Wed, Apr 13, 2011 at 4:43 AM, Grant Ingersoll <gsing...@apache.org> wrote: > > On Apr 7, 2011, at 9:17 PM, Yang wrote: > >> I'm new to lucene/search engine , and have been struggling with these >> questions recently. >> I'd appreciate a lot of you could shed some light on this. >> >> >> let's say I do a query on >> >> dog greyhound >> >> note that I did not quote them, i.e. this is not a phrase search. >> >> what happens under the hood ? > > A lot and it is really beyond the scope of an email, but... > > Assuming dog and greyhound are not removed by analysis, it will go to the > inverted index and look up each of the terms and the documents for each of > the terms and score them (see BooleanQuery for how it combines them). I > would suggest setting up a really simple program that does this query and > then just step through the code. You may even want to start a little simpler > with a single term. > >> >> which term does Lucene use to look up the inverted Index ? > > Both > > >> I read somewhere that Lucene uses the term with the higher IDF (i.e. >> the more distinguishing term), i.e. in this case >> "greyhound", but what about dog? does Lucene traverse down the doclist >> of "dog" at all? > > Yes, it does. > >> if I provide multiple terms in my query, >> generally how does Lucene decide how many doclists to travel down? >> > > I don't believe we currently do any pruning, so all will be visited. > >> >> I read that Lucene uses a combination of "binary model" > > I would say "Boolean Model" > >> and VSM, then >> it seems that in the above case, it finds >> the full doclist of dog , and that of "greyhound", (the binary model >> part), then find the common docs from the two doclists, >> then order them by scores ( the VSM part). > > Yes, assuming an AND combines them. If it's an OR, then it will union the > lists. > >> is it true that the FULL >> doclists are fetched first? or is some pruning done on the individual >> doclists? > > All docs that contain the word are scored and can potentially be returned if > you have enough time/memory (not recommended) > >> I see the >> talk in http://www.slideshare.net/abial/eurocon2010 that talks about >> pruning and tiered search, but is this the default behavior of Lucene? > > That is more advanced and is something you can do w/ Lucene, but is not out > of the box. > >> how are the doclists sorted? (by idf ?? --- sorry I'm just beginning >> to sift through a lot of docs online, somehow got this impression but >> can't form a precise conclusion) >> > > It depends on the query, etc. Have a look at the Scoring page on the website > for more info. > >> >> >> also generally, could you please provide some good articles on how >> lucene/search engines work? I've read the "anatomy of a search engine" >> (google Sergey Brin & Larry Page paper), >> "introduction to information retrieval (Manning et al ) " , "Lucene >> in action" .... > > I'd start w/ Lucene in Action 2nd ed. Brin and Page paper is good. As is > the Manning book, Baeza Yates, Grossman, etc. I believe we have a resources > page on our Wiki that lists out a lot of books and talks. > > I would recommend, however, just trying out small examples of different > queries and stepping through the code if you really want to know the guts of > Lucene. > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem docs using Solr/Lucene: > http://www.lucidimagination.com/search > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
--------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org