First, Lucene itself has related-search functionality. If you are using it, I suspect it will be better to leverage that. (Even if not it is possible to run Lucene just for this purpose.) Others much more familiar with Lucene can comment.
I can comment on your current approach. Yes, it seems reasonable. You are effectively using just a piece of a user-based recommender, and that is the UserNeighborhood component. This is all you need, such as NearestNUserNeighborhood and your similarity metric. Scalability could be an issue, since you have a 'user' for every distinct query. Consider normalizing queries a lot - decapitalize, remove whitespace, sort by query term, keep only first n terms, etc. Also for this reason, consider using BooleanUserTanimotoCoefficientSimilarity and BooleanPrefUserDatModel (off the top of my hrad even I am not sure I got those names right!) Because you do not have a reliable notion of strength of preference, you should ignore preference value, and these implementations let you take advantage of this. Please use the very latest code from Subversion (Mahout 0.2 SNAPSHOT). I am right now working with a client using these implementations and have been fixing and optimizing them a lot recently. Finally, you want to boost queries that are popular. You can use a Rescorer for this to inject any score changes you like. Perhaps you devise some function that increases the similarity towards 1 the more the two queries are observed. Let us start there and follow up with questions here. On Jul 20, 2009 9:50 AM, "Claudia Grieco" <[email protected]> wrote: Hi guys! I'm trying to implement a "related search" feature using the mahout libraries. The queries are used to retrieve a set of items memorized in a DB. I have come up with this implementation: -treat queries as "Users" and items in the DB as "Items" -for each query entered in the search engine I memorize the text of the query and the first 10 items retrieved. (the user-item column contains the query id, the item id and the relevance score of the item for the query) -to compute searches related to the current search I use mahout's Tanimoto similarity to find the most similar "users", i.e. the queries which have more result items in common with the current query. Is there a way to improve what I have done? I'd like to increase the importance of a query according to its "popularity" (i.e. how many time the query was entered) and/or keep trace of the most clicked items instead of the first 10 items, but I can't figure out how to do it. Any ideas?
