On Thu, Jan 31, 2008 at 11:09 AM, Doron Cohen <[EMAIL PROTECTED]> wrote:
> Hi Otis, > > On Thu, Jan 31, 2008 at 7:21 AM, Otis Gospodnetic < > [EMAIL PROTECTED]> wrote: > > > Doron - this looks super useful! > > Can you give an example for the lexical affinities you mention here? > > ("Juru creates posting lists for lexical affinities") > > > Sure, - simply put, denote {X} as the posting list of term X, then for a > query - A B C D - in addition to the four posting lists {A}, {B}, {C}, {D} > which are processed ignoring position info (i.e. Lucene's termDocs()) Juru > also computes combined posting lists {A,B}, {A,C}, {A,D}, {B,C}, {B,D} and > {C,D} in which a (virtual) term {X,Y} is said to exist in a document D if > the two words X and Y are found in that document within a sliding window of > size L (say 5). > The wiki page now has a more complete example. (You can also require LA's in order which is useful in some scenarios.) > > Juru's tokenization detects sentences and so the two words must be in the > same sentence. The term-freq of that LA-term in the doc is as usual the > number of matches in that doc satisfying this sliding window rule. > > The IDF of this term is not known in advance, and so it is first estimated > based on the DF of X and Y, and this estimate is later tuned as more > documents are processed and more statistics are available. > This was not so accurate a description. What Juru really does is compute in advance the first e.g. 1MB of the LA posting and use its computed IDF for the entire posting. Experiments with more accurate adaptive computation (for longer LA postings) showed no advantae over this simpler approach. > You can see the resemblance to SpanNear queries. Note that the IDF of this > virtual term is going to be high and as such it is "focusing" the query > search on the more relevant documents. > > In my Lucene implementation for this I used a window size of 7, and note > that (1) there was no sentence boundaries knowledge in my Lucene > implementation and (2) the IDF was fixed all along, estimated by the > involved terms IDF, as computed once in SpanNear query. The default > computation is their sum. This is in most cases too low an IDF, I think. > Phrase query btw behaves the same. > > So in both cases (Phrase, Span) I think it would be interesting to > experiment with adaptive IDF computation that updates the IDF as more > documents are processed. When the query is made of only a single span or > only a single phrase element this is a waste of time. But when the query is > more complex (as the query we built) and you have in the query both > multi-term parts and single-term parts, or several multi-term parts, then a > more accurate IDF can improve the quality I would think. Implementation wise > the "Weight.value" would need to be updated and might raise questions > about the normalizing of other query parts, but I am not sure about this > now. > Well after discussing this with my colleague David Carmel who pointed out that summing the IDFs actually makes sense because each IDF is *nearly* a log of the nDocs/DF and so summing the nearly logs is (nearly) the log of the multiplication (of (1+nDocs/DF)). So I don't anymore see here a problem to fix or an immediate oportunity to explore... > > > Well I hope this makes sense - I will update the Wiki page with similar > info... > > Also: > > > > "Normalized term-frequency, as in Juru. > > Here, tf(freq) is normalized by the average term frequency of the > > document." > > > > I've never seen this mentioned anywhere except here and once here on the > > ML (was it you who mentioned this?), but this sounds intuitive. > > > Yes I think I mentioned this - I think it is not our idea - Juru uses it > but it was used before in the SMART system - see "Length Normalization in > Degraded Text Collections (1995)" - > http://citeseer.ist.psu.edu/100699.html, and "New Retrieval Approaches > Using SMART : TREC 4" - http://citeseer.ist.psu.edu/144841.html. > > > > What do others think? > > Otis > > > >