Re: search quality - assessment & improvements

Grant Ingersoll Mon, 25 Jun 2007 05:25:05 -0700

Just to throw in a few things:

First off, this is great!

As I am sure you are aware: https://issues.apache.org/jira/browse/LUCENE-836


On Jun 25, 2007, at 3:15 AM, Doron Cohen wrote:

hi, this could probably split into two threads but for contextlet's start
it in a single discussion;

Recently I was looking at the search quality of Lucene - Recall and
Precision, focused at [EMAIL PROTECTED],5,10,20 and, mainly, MAP.

-- Part 1 --

I found out that quality can be enhanced by modifying the doc length
normalization, and by changing the tf() computation to alsoconsider the
average tf() in a single document.
For the first change, logic is that Lucene's default lengthnormalizationpunishes long documents too much. I found contrib's sweet-spot-similarity
helpful here, but not enough. I found that a better doc-length
normalization method is one that considers collection statistics -e.g.
average doc length. The nice problem with such an approach is that you
don't know the average length at indexing time, and it changes asthe indexevolves. The static nature of norms computation (and API) in Luceneis,while efficient, an obstacle for global computations. Another issuehere isthat applications often split documents into fields from reasonsthat arenot "pure IR", for instance - content field and title field, justto be
able to boost the title by (say) 3, but in fact, there is no "IR'ish"
difference between finding the searched text in the title field orin thebody field - they really serve/answer the same information need.For thatmatter, I believe that using a single document length whensearching all
these fields is more "accurate".

Further complicated by apps that duplicate fields for things likecase-sensitive search, etc. This is where having more fieldsemantics would be

useful, ala Solr or some other mechanism.

Also, are you making these judgements based on TREC?

For the second change logic, - assume two documents, doc1containing 10"A"'s, 10 "B"'s, and 10 "Z"'s, and doc2 containing "A" to "T" and10 "Z"'s.Both doc1 and doc2 are of length 30. Searching for "Z", in bothdoc1 anddoc2 tf("Z")=10. So, currently, doc1 and doc2 score the same for"Z", butthe "truth" is that "Z" is much more representing/important in doc2than itis in doc1, because its frequency in doc2 is 10 times more than alltheother words in that doc, while in doc1 it is the same as the otherwords inthat doc. If you agree about the potential improvement here, again,a niceproblem is that current Similarity API does not even allow toconsider this
info (the average term frequency in the specific document) because
Similarity.tf(int/float freq) takes only the frequency param. Oneway to
open way for such computation is to add an "int docid" param to the
Similarity class, but then the implementation of that class becomes
IndexReader aware.

Both modifications above have, in addition to API implications also
performance implications, mainly search performance, and I wouldlike toget some feedback on what people think about going in thisdirection...
first the "if", only then the "how"...

Perhaps revisiting Flexible Indexing is the way to go. The trickwill be in how to write an API that supports the current way, butalso allows us to add new methods for these kind of things.

-- Part 2 --
It is very important that we would be able to assess the searchquality in
a repeatable manner - so that anyone can repeat the quality tests, and
maybe find ways to improve them. (This would also allow to verify the
"improvements claims" above...). This capability seems like anatural part
of the benchmark package. I started to look at extending the benchmark
package with search quality module, that would open an index (or first
create one), run a set of queries (similar to the performancebenchmark),
and compute and report the set of known statistics mentioned above and
more. Such a module depends on input data - documents, queries, and
judgements. And that's my second question. We don't have to inventthisdata - TREC has it already, and it is getting wider every year asthere aremore judgements. So, theoretically we could use TREC data. Oneproblem hereis that TREC data should be purchased. Not sure that this is aproblem - itis OK if we provide the mechanism to use this data for those whohave it(Universities, for one). The other problem is that it is not clearto me
what can one legally say on a certain system's results on TREC data. I
would like the Search Quality Web page of Lucene to say somethinglike:"MAP of XYZ for Track Z of TREC 2004", and then a certain submittedpatchto say "I improved to 1.09*XYZ". But would that be legal? I just re-read
their "Agreement Concerning Dissemination of TREC Results" -
http://trec.nist.gov/act_part/forms/noads.html - and I am not feeling
smarter about this.

IANAL and I didn't read the link, but I think people publish theirMAP scores, etc. all the time on TREC data. I think it implies thatyou obtained the data through legal means.

I agree about providing the mechanism to work with TREC. I also havehad a couple of other thoughts/opinions/alternatives (my own,personal opinion):

1. Create our own judgements on Wikipedia or the Reuters collection.This is no doubt hard and would require a fair number of volunteersand could/would compete at some level with TREC. One advantage isthe whole process would be open, whereas the TREC process is not. Itwould be slow to develop, too, but could be highly useful to thewhole IR community. Perhaps we could make a case at SIGIR orsomething like that for the need for a truly open process. Perhapswe could post on SIGIR list or something to gauge interest. I don'treally know if that is the proper place or not. I have just recentlysubscribed to the mailing list, so I don't have a feel for thepostings on that list. Perhaps a new project? Lucene Relevance,OpenTREC, FreeTREC? Seriously, Nutch could use relevance judgmentsfor the "web track" and Solr could use it for several tracks, andLucene J. as well. And I am sure there are a lot of other OS searchengines that would benefit.

2. Petition NIST to make TREC data available to open source searchprojects. Perhaps someone acting as an official part of ASF couldsubmit a letter (I am willing to do so, I guess, given help draftingit) after it goes through legal, etc. I'm thinking of somethingsimilar to what has been going on with the Open Letter to Sunconcerning the Java implementation. Perhaps simply asking would beenough to start a dialog on how it could be done. We may have tocome up w/ safeguards on downloads or something, I don't know. Iwould bet the real issue with data is that it is copyrighted and weare paying to license it. Perhaps we should start lobbying TREC touse non-copyrighted information. Maybe if we got enough open sourcesearch libraries interested we could make some noise! Maybe we couldall go protest outside of the TREC conference! Ha, ha, ha! We wouldneed a catchy chant, though. And if anyone thinks I am serious aboutthis last part, I am not.



Cheers,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: search quality - assessment & improvements

Reply via email to