Erik Hatcher <[EMAIL PROTECTED]> writes:
By all means, if you have other suggestions for our site, let us know at [EMAIL PROTECTED]
One of the things I would like to see, but which isn't either in the Lucene site, documentation, or "Lucene in Action", is a complete description of how the retrieval algorithm works. That is, how the HitCollector, Scorers, Similarity, etc all fit together.
I'm involved in a project which to some degree is looking at poking deeply into this part of the Lucene code. We have a nice (non-Lucene) framework for working with more different kinds of similarity functions (beyond tf-idf) which should also be expandable to include query expansion, relevance feedback, and the like.
I used to think that integrating it would be as simple as hacking in Similarity, but I'm beginning to think it might need broader changes. I could obviously hook in our whole retrieval setup by just diving for an IndexReader and doing it all by hand, but then I would have to redo the incremental search and possibly the rich query structure, which would be a lose.
So anyway, I got LIA hoping for a good explanation (not a good Explanation) on this bit, but it wasn't there.
Hacking Similarity wasn't covered in LIA for one simple reason - Lucene's built-in scoring mechanism really is good enough for almost all projects. The book was written for developers of those projects.
Personally, I've not had to hack Similarity, though I've toyed with it in prototypes and am using a minor tweak (turning off length normalization for the "title" field) for the lucenebook.com book indexing.
There are some hints on the Lucene site, but nothing complete. If I muddle it out before anything gets contributed, I'll try to write something up, but don't expect anything too soon...
And maybe you'd contribute what you write to LIA 2nd edition :)
Erik
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
