There are two ways you can work with the taxonomy index in a distributed environment (at least, these are the things that we've tried): (1) replicate the taxonomy to all shards, so that each shard sees the entire global taxonomy (2) each shard maintains its own taxonomy.
(1) only makes sense when the shards are built by a side process,e.g. MapReduce, and then copied to their respective nodes. If you index like that, then your distributed faceted search (correcting the counts of categories) is done on ordinals rather than strings. (2) is the one that makes sense to most people, and is also NRT (where #1 isn't!). Each shard maintains its local search + taxonomy indexes. In that mode, the counts correction cannot be done on ordinals, and has to be done on strings. When you're doing distributed faceted search, you cannot just ask each shard to return the top-10 categories for the "Author" dimension/facet, because then you might (1) miss a category that should have been in the total top-10 and (2) return a category in the top-10 with incorrect counts. What you can do is ask for C*10, where C is an over-counting factor. You'd still hope for the best though, b/c theoretically, you could still miss a category / have incorrect count for one. The difference between the two approaches is how big C can be. In the first approach, since all you transmit on the wire are integers, and the merge is done on integers, you can set C much higher than in the second approach. In practice though, since more and more applications are interested in real-time search, we keep a local taxonomy index per each shard and do the merge on the strings. Also, when you're doing really large-scale, exact counts for categories may not be so usable. How is Science Fiction (123,367,129) different than Drama (145,465,987) !? To the user these are just categories that are associated with too many documents than I can digest anyway :). For that, we do sampling and display %, which is more consumable by users, and then you don't need to worry about exact counts. I think that I wrote a bit too long an answer to your question :). Regarding not deleting categories, we've thought about it in the past and I'm not sure it's a problem. I mean, in theory, yes, you could end up w/ a taxonomy index that has many unused categories. But: * Whenever we were dealing with timestamp-based applications, at large scale, they always created shards per time (e.g. per day / hour) and when the taxonomy index is local to the shard, then it's gone completely when the shard is gone. * You can always re-map the ordinals to new ones by running a side process which checks which of the categories are unused, adds those that are in use to a new taxonomy index and rewrites the payload posing of the search index. It sounds expensive, but we've never had to do it yet, so I don't know how much expensive. At the end of the day, the facets module lets you build the faceted search that best suits your needs. It can work entirely off-disk, it can be loaded in-memory (similar to FieldCache, Mike and I are working on some improvements there - you're welcome to join!), it can support exact counts or sampling, other aggregation methods than just counting and many more. The sidecar taxonomy index is not as bad as it sounds. As I've told you, many IBM products are working with it for many years, at small and large scale. I think that Solr could benefit from this module too, and I hope that I don't sound too biased :). Having Solr reusing Lucene modules is important IMO. Shai On Wed, Dec 12, 2012 at 1:12 AM, Lukáš Vlček <[email protected]> wrote: > Hi Shai, > > thanks for your blog, I am looking forward to your future posts! > > Just two questions: you mentioned that you have been running this in > production in distributed mode. If I understand it correctly the idea is > there is only a single taxonomy index even if the distributed mode means > that the data indices were partitioned/sharded. (Thus the ordinals are > global). The taxonomy index is not partitioned/sharded itself. Am I correct? > > Also what seems to be an interesting implication of this implementation is > the fact that taxonomy index never cares about deleted documents > (categories that are obsolete). In practices this is probably not a bit > deal because the taxonomy index is small but I can imagine this might be > problematic in some situations (for example imagine that the categories > would be based on highly granular timestamp, that could create a lot of > categories over short period of time and those would be kept "forever" and > still growing...). > (^^ I am just trying to understand how it works.) > > Regards, > Lukas >
