I think the idf is also about terms and not about tokens. Maybe an expert can confirm my belief or we have to invent a test.
paul Le 3 janv. 2012 à 15:43, heikki a écrit : > hi Paul, > > yes, but my concern isn't about the term-frequency, but rather the > inverted-document-frequency, which also is used in the relevance score and > which takes into account all documents in the index.. in this way the > relevance score of one document is influenced by the contents of all other > documents that are in the same index. This is why it seems logical to me > that if different domains use separate indexes, the relevance scoring is > more accurate. > > > Kind regards, > Heikki Doeleman > > > > > On Tue, Jan 3, 2012 at 3:29 PM, Paul Libbrecht <p...@hoplahup.net> wrote: > >> Heikki, >> >> it does solve your main concern: a term in lucene is a pair of a token and >> field name. >> The term frequency is, thus, the frequency of a token in a field. >> >> So the term-frequency of text-stemmed-de:firewall is independent of the >> term-frequency of text-stemmed-en:firewall (for example). >> >> But using the query expansion mechanism, it is likely that both >> term-queries will be present and both contribute to the score. Which is >> correct I think. >> >> paul >> >> >> Le 3 janv. 2012 à 15:06, heikki a écrit : >>> >>>> The important bit is to use query-expansion. >>>> Given a query of the user (with params or not, with text-queries), >> expand >>>> it to a query where the "normal text" is expected to be in the right >>>> language, but maybe also in one of the other languages (that >>>> the browser says, that your platform supports), with less weight of >>> course. >>> >>> something like that we do now in a single index solution - results in the >>> requested language are boosted enough so they're always on top >>> >>> I don't think though that this addresses what is my main point: the >>> frequency of terms in different domains (in this case, different >> languages) >>> is different for each domain. This means that if the domains are chunked >>> together in one index, the IDF value for a term is less "accurate" than >> if >>> multiple, separate indexes were used. A term is more or less frequent in >>> one domain or another, for a reason.. Relevance ranking is impacted by >>> that, and is more accurate if separate indexes are used -- I think this >>> seems logical. >>> >>> I just don't know how much impact it really has, and whether it is worth >> to >>> deal with it by presenting separate result sets from separate index >>> searches .. >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org