On Sat, 2011-07-09 at 05:44 +0200, Shai Erera wrote: > The taxonomy is global to the index, but I think it will be > interesting to explore per-segment taxonomy, and how it can be used to > improve indexing or search perf (hopefully both).
I have struggled with this for some time and still haven't found a real solution. Distributed faceting, with the special case segment based faceting, is hard to do without a central taxonomy. The new faceting module is explicit about the central taxonomy. My experiments with https://issues.apache.org/jira/browse/LUCENE-2369 computes it at index open time. None of them work very well, if at all, for a real distributed environment. The problem is the same for flat faceting but is magnified with hierarchical faceting: When the sorting order of facet elements is popularity based, computing the correct counts for a top-X might potentially involve comparison of the whole result from each part. A pathological case for flat faceting is Part 1: A1(2), A2(2)... An(2) Part 2: B1(3), B2(2), B3(2)... Bn(2), An(1) where the correct top 3 answer is An(3), B1(3), A2(2), which requires the full part results to get to the An(2) and An(1) as they are the last elements. For real world use, we can do clever counting so that we only return what is necessary, but it does not change the worst case. To ensure that we don't hit any million entries merge situations, we must cheat and make a cutoff point. With a multi-level faceting result (state/town/street expanded to top 5 elements on all levels) we must resolve quite a lot of elements to ensure a high chance of getting the right elements with the right counts. We can avoid this by drilling down one level at a time, but that is just replacing bulk transfers with multiple requests: 1*5*5 is the unrealistically low minimum for the address case. - Toke --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
