Doug Cutting wrote:
Chuck Williams wrote:
Christoph Goller writes: > You may be right. But I am not completely convinced. I think > this should be decided based on the proposed benchmark evaluation.
Is that still happening?
Like anything else in an all-volunteer operation, it will only happen if folks volunteer to do it. Someone needs to take the lead and index a reference collection with a couple of different Similarity implementations and post the code and the results of various searches for folks to evaluate. Chuck?
In theory I can probably easily do this, esp if someone would submit another Similarity implementation.
The corpus easiest for me to use is the subset of the English Wikipedia I've been playing with. It has 400k documents..let's see, max length of a body is 258kb, avg len of non-trival entries (size > 100 chars) is 2450 chars, and std dev is 3400 chars. I'm using "wikipedia namespace 0" which means the normal encyclopedia pages and not things like chatlogs, help pages, or whatnot.
I recently made a demo page of the MoreLikeThis similarity query generator + related algorithms (confusion alert, 'similarity' means "show me documents similar to another doc", is implemented on top of Lucene, and is not the same as org.apache.lucene.search.Similarity...)
The page runs 3 algorithms in parallel and displays them on 1 page.
Here's a page that shows the 3 cols, 1 per alg:
http://www.searchmorph.com/kat/wikipedia-compare.jsp?s=Information_retrieval
And you get there from a normal wikipedia search, click on "cmp" on the right of one of the matching docs:
http://www.searchmorph.com/kat/wikipedia.jsp?s=information+retrieval
Oh, and the relevance to this thread is, I'm assuming this is what we want to compare the different Similarity implementations, an easy way of seeing how they perform against a given query.
So, moving forward, if anyone agrees in general with me:
[1] Post some reasonable/interesting Similarity implementations [2] Confirm that it makes sense to compare them on 1 screen "in parallel"
thx, Dave
Doug
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]