Hi all,

After reading this paper:

http://wortschatz.uni-leipzig.de/~fwitschel/papers/ipm1152.pdf

I came up with the following idea of implementing global IDF in Nutch. The upside of the approach I propose is that it brings back the cost of making a search query to 1 RPC call. The downside is that the search servers need to cache global IDF estimates as computed by the DS.Client, which ties them to a single query front-end (DistributedSearch.Client), or requires keeping a map of <client, globalIDFs> on each search server.

---------

First, as the paper above claims, we don't really need exact IDF values of all terms from every index. We should get acceptable quality if we only learn the top-N frequent terms, and for the rest of them we apply a smoothing function that is based on global characteristics of each index (such as the number of terms in the index).

This means that the data that needs to be collected by the query integrator (DS.Client in Nutch) from shard servers (DS.Server in Nutch) would consist of a list of e.g. top 500 local terms with their frequency, plus the local smoothing factor as a single value.

We could further reduce the amount of data to be sent from/to shard servers by encoding this information in a counted Bloom filter with a single-byte resolution (or a spectral Bloom filter, whichever yields a better precision / bit in our case).

The query integrator would ask all active shard servers to provide their local IDF data, and it would compute global IDFs for these terms, plus a global smoothing factor, and send back the updated information to each shard server. This would happen once per lifetime of a local shard, and is needed because of the local query rewriting (and expansion of terms from Nutch Query to Lucene Query).

Shard servers would then process incoming queries using the IDF estimates for terms included in the global IDF data, or the global smoothing factors for terms missing from that data (or use local IDFs).

The global IDF data would have to be recomputed each time the set of shards available to a DS.Client changes, and then it needs to be broadcast back from the client to all servers - which is the downside of this solution, because servers need to keep a cache of this information for every DS.Client (each of them possibly having a different list of shard servers, hence different IDFs). Also, as shard servers come and go, the IDF data keeps being recomputed and broadcast, which increases the traffic between the client and servers.

Still I believe the amount of additional traffic should be minimal in a typical scenario, where changes to the shards are much less frequent than the frequency of sending user queries. :)

------

Now, if this approach seems viable (please comment on this), what should we do with the patches in NUTCH-92 ?

1. skip them for now, and wait until the above approach is implemented, and pay the penalty of using skewed local IDFs.

2. apply them now, and pay the penalty of additional RPC call / search, and replace this mechanism with the one described above, whenever that becomes available.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to