Re: [Lucene-java Wiki] Update of "TREC 2007 Million Queries Track - IBM Haifa Team" by DoronCohen

Grant Ingersoll Wed, 06 Feb 2008 04:16:37 -0800

Hey Doron,

I see you recommend that we think about making SweetSpot the defaultsimilarity. Do you have numbers showing for running that alone? Orfor that matter, any of the other combinations of #3 individually?


Thanks,
Grant

On Jan 31, 2008, at 4:09 AM, Doron Cohen wrote:

Hi Otis,

On Thu, Jan 31, 2008 at 7:21 AM, Otis Gospodnetic <
[EMAIL PROTECTED]> wrote:
Doron - this looks super useful!
Can you give an example for the lexical affinities you mention here?
("Juru creates posting lists for lexical affinities")
Sure, - simply put, denote {X} as the posting list of term X, thenfor aquery - A B C D - in addition to the four posting lists {A}, {B},{C}, {D}which are processed ignoring position info (i.e. Lucene'stermDocs()) Jurualso computes combined posting lists {A,B}, {A,C}, {A,D}, {B,C},{B,D} and{C,D} in which a (virtual) term {X,Y} is said to exist in a documentD ifthe two words X and Y are found in that document within a slidingwindow of
size L (say 5).
(You can also require LA's in order which is useful in somescenarios.)
Juru's tokenization detects sentences and so the two words must bein thesame sentence. The term-freq of that LA-term in the doc is as usualthe
number of matches in that doc satisfying this sliding window rule.
The IDF of this term is not known in advance, and so it is firstestimated
based on the DF of X and Y, and this estimate is later tuned as more
documents are processed and more statistics are available.
You can see the resemblance to SpanNear queries. Note that the IDFof thisvirtual term is going to be high and as such it is "focusing" thequery
search on the more relevant documents.
In my Lucene implementation for this I used a window size of 7, andnote
that (1) there was no sentence boundaries knowledge in my Lucene
implementation and (2) the IDF was fixed all along, estimated by the
involved terms IDF, as computed once in SpanNear query. The default
computation is their sum. This is in most cases too low an IDF, Ithink.
Phrase query btw behaves the same.

So in both cases (Phrase, Span) I think it would be interesting to
experiment with adaptive IDF computation that updates the IDF as more
documents are processed. When the query is made of only a singlespan oronly a single phrase element this is a waste of time. But when thequery is
more complex (as the query we built) and you have in the query both
multi-term parts and single-term parts, or several multi-term parts,then amore accurate IDF can improve the quality I would think.Implementation wisethe "Weight.value" would need to be updated and might raisequestions aboutthe normalizing of other query parts, but I am not sure about thisnow.
Well I hope this makes sense - I will update the Wiki page withsimilar
info...

Also:
"Normalized term-frequency, as in Juru.
Here, tf(freq) is normalized by the average term frequency of the
document."
I've never seen this mentioned anywhere except here and once hereon the
ML (was it you who mentioned this?), but this sounds intuitive.
Yes I think I mentioned this - I think it is not our idea - Juruuses it but
it was used before in the SMART system - see "Length Normalization in
Degraded Text Collections (1995)" - http://citeseer.ist.psu.edu/100699.html,
and "New Retrieval Approaches Using SMART : TREC 4" -
http://citeseer.ist.psu.edu/144841.html.
What do others think?
Otis


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [Lucene-java Wiki] Update of "TREC 2007 Million Queries Track - IBM Haifa Team" by DoronCohen

Reply via email to