Hi, On Wed, Nov 26, 2008 at 3:04 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Hi all, > > After reading this paper: > > http://wortschatz.uni-leipzig.de/~fwitschel/papers/ipm1152.pdf > > I came up with the following idea of implementing global IDF in Nutch. The > upside of the approach I propose is that it brings back the cost of making a > search query to 1 RPC call. The downside is that the search servers need to > cache global IDF estimates as computed by the DS.Client, which ties them to > a single query front-end (DistributedSearch.Client), or requires keeping a > map of <client, globalIDFs> on each search server. > > --------- > > First, as the paper above claims, we don't really need exact IDF values of > all terms from every index. We should get acceptable quality if we only > learn the top-N frequent terms, and for the rest of them we apply a > smoothing function that is based on global characteristics of each index > (such as the number of terms in the index). > > This means that the data that needs to be collected by the query integrator > (DS.Client in Nutch) from shard servers (DS.Server in Nutch) would consist > of a list of e.g. top 500 local terms with their frequency, plus the local > smoothing factor as a single value. > > We could further reduce the amount of data to be sent from/to shard servers > by encoding this information in a counted Bloom filter with a single-byte > resolution (or a spectral Bloom filter, whichever yields a better precision > / bit in our case). > > The query integrator would ask all active shard servers to provide their > local IDF data, and it would compute global IDFs for these terms, plus a > global smoothing factor, and send back the updated information to each shard > server. This would happen once per lifetime of a local shard, and is needed > because of the local query rewriting (and expansion of terms from Nutch > Query to Lucene Query). > > Shard servers would then process incoming queries using the IDF estimates > for terms included in the global IDF data, or the global smoothing factors > for terms missing from that data (or use local IDFs). > > The global IDF data would have to be recomputed each time the set of shards > available to a DS.Client changes, and then it needs to be broadcast back > from the client to all servers - which is the downside of this solution, > because servers need to keep a cache of this information for every DS.Client > (each of them possibly having a different list of shard servers, hence > different IDFs). Also, as shard servers come and go, the IDF data keeps > being recomputed and broadcast, which increases the traffic between the > client and servers. > > Still I believe the amount of additional traffic should be minimal in a > typical scenario, where changes to the shards are much less frequent than > the frequency of sending user queries. :) > > ------ > > Now, if this approach seems viable (please comment on this), what should we > do with the patches in NUTCH-92 ? > > 1. skip them for now, and wait until the above approach is implemented, and > pay the penalty of using skewed local IDFs. > > 2. apply them now, and pay the penalty of additional RPC call / search, and > replace this mechanism with the one described above, whenever that becomes > available. >
It seems I wrote the patch in NUTCH-92. My recollection was that you wrote it, Andrzej :D Anyway, I have no idea what I did in that patch, don't know if it works or applies etc. Really, I am just curios. Did anyone test it? Does it really work :) ? I haven't read the paper yet but the proposed approach sounds better to me. Do you have any code ready, Andrzej? Or how difficult is it to implement it? > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Doğacan Güney