NUTCH-92

Andrzej Bialecki Tue, 25 Nov 2008 17:05:02 -0800

Hi all,

After reading this paper:


http://wortschatz.uni-leipzig.de/~fwitschel/papers/ipm1152.pdf

I came up with the following idea of implementing global IDF in Nutch.The upside of the approach I propose is that it brings back the cost ofmaking a search query to 1 RPC call. The downside is that the searchservers need to cache global IDF estimates as computed by the DS.Client,which ties them to a single query front-end (DistributedSearch.Client),or requires keeping a map of <client, globalIDFs> on each search server.


---------

First, as the paper above claims, we don't really need exact IDF valuesof all terms from every index. We should get acceptable quality if weonly learn the top-N frequent terms, and for the rest of them we apply asmoothing function that is based on global characteristics of each index(such as the number of terms in the index).

This means that the data that needs to be collected by the queryintegrator (DS.Client in Nutch) from shard servers (DS.Server in Nutch)would consist of a list of e.g. top 500 local terms with theirfrequency, plus the local smoothing factor as a single value.

We could further reduce the amount of data to be sent from/to shardservers by encoding this information in a counted Bloom filter with asingle-byte resolution (or a spectral Bloom filter, whichever yields abetter precision / bit in our case).

The query integrator would ask all active shard servers to provide theirlocal IDF data, and it would compute global IDFs for these terms, plus aglobal smoothing factor, and send back the updated information to eachshard server. This would happen once per lifetime of a local shard, andis needed because of the local query rewriting (and expansion of termsfrom Nutch Query to Lucene Query).

Shard servers would then process incoming queries using the IDFestimates for terms included in the global IDF data, or the globalsmoothing factors for terms missing from that data (or use local IDFs).

The global IDF data would have to be recomputed each time the set ofshards available to a DS.Client changes, and then it needs to bebroadcast back from the client to all servers - which is the downside ofthis solution, because servers need to keep a cache of this informationfor every DS.Client (each of them possibly having a different list ofshard servers, hence different IDFs). Also, as shard servers come andgo, the IDF data keeps being recomputed and broadcast, which increasesthe traffic between the client and servers.

Still I believe the amount of additional traffic should be minimal in atypical scenario, where changes to the shards are much less frequentthan the frequency of sending user queries. :)


------

Now, if this approach seems viable (please comment on this), what shouldwe do with the patches in NUTCH-92 ?

1. skip them for now, and wait until the above approach is implemented,and pay the penalty of using skewed local IDFs.

2. apply them now, and pay the penalty of additional RPC call / search,and replace this mechanism with the one described above, whenever thatbecomes available.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

NUTCH-92

Reply via email to