First, Lucene itself has related-search functionality. If you are using it,
I suspect it will be better to leverage that. (Even if not it is possible to
run Lucene just for this purpose.) Others much more familiar with Lucene can
comment.

I can comment on your current approach. Yes, it seems reasonable. You are
effectively using just a piece of a user-based recommender, and that is the
UserNeighborhood component. This is all you need, such as
NearestNUserNeighborhood and your similarity metric.

Scalability could be an issue, since you have a 'user' for every distinct
query. Consider normalizing queries a lot - decapitalize, remove whitespace,
sort by query term, keep only first n terms, etc.

Also for this reason, consider using
BooleanUserTanimotoCoefficientSimilarity and BooleanPrefUserDatModel (off
the top of my hrad even I am not sure I got those names right!)  Because you
do not have a reliable notion of strength of preference, you should ignore
preference value, and these implementations let you take advantage of this.

Please use the very latest code from Subversion (Mahout 0.2 SNAPSHOT). I am
right now working with a client using these implementations and have been
fixing and optimizing them a lot recently.

Finally, you want to boost queries that are popular. You can use a Rescorer
for this to inject any score changes you like. Perhaps you devise some
function that increases the similarity towards 1 the more the two queries
are observed.

Let us start there and follow up with questions here.

On Jul 20, 2009 9:50 AM, "Claudia Grieco" <[email protected]> wrote:

Hi guys!

I'm trying to implement a "related search" feature using the mahout
libraries. The queries are used to retrieve a set of items memorized in a
DB.

I have come up with this implementation:

-treat queries as "Users" and items in the DB as "Items"

-for each query entered in the search engine I memorize the text of the
query and the first 10 items retrieved. (the user-item column contains the
query id, the item id and the relevance score of the item for the query)

-to compute searches related to the current search I use mahout's Tanimoto
similarity to find the most similar "users", i.e. the queries which have
more result items in common with the current query.



Is there a way to improve what I have done? I'd like to increase the
importance of a query according to its "popularity" (i.e. how many time the
query was entered) and/or keep trace of the most clicked items instead of
the first 10 items, but I can't figure out how to do it.

Any ideas?

Reply via email to