even as a step it would be nice to have lucene's faceting exposed to solr in a way that only works with a single node.
because it supports NRT, doesnt need to build up massive top-level datastructures and so on, many people that currently need multiple nodes might be able to work just fine with a single node. On Wed, Dec 12, 2012 at 2:28 AM, Shai Erera <ser...@gmail.com> wrote: > There are two ways you can work with the taxonomy index in a distributed > environment (at least, these are the things that we've tried): > (1) replicate the taxonomy to all shards, so that each shard sees the entire > global taxonomy > (2) each shard maintains its own taxonomy. > > (1) only makes sense when the shards are built by a side process,e.g. > MapReduce, and then copied to their respective nodes. > If you index like that, then your distributed faceted search (correcting the > counts of categories) is done on ordinals rather than > strings. > > (2) is the one that makes sense to most people, and is also NRT (where #1 > isn't!). Each shard maintains its local search + > taxonomy indexes. In that mode, the counts correction cannot be done on > ordinals, and has to be done on strings. > > When you're doing distributed faceted search, you cannot just ask each shard > to return the top-10 categories for the "Author" dimension/facet, > because then you might (1) miss a category that should have been in the > total top-10 and (2) return a category in the top-10 with > incorrect counts. What you can do is ask for C*10, where C is an > over-counting factor. You'd still hope for the best though, b/c > theoretically, you could still miss a category / have incorrect count for > one. > > The difference between the two approaches is how big C can be. In the first > approach, since all you transmit on the wire are > integers, and the merge is done on integers, you can set C much higher than > in the second approach. In practice though, since > more and more applications are interested in real-time search, we keep a > local taxonomy index per each shard and do the merge > on the strings. > > Also, when you're doing really large-scale, exact counts for categories may > not be so usable. How is Science Fiction (123,367,129) > different than Drama (145,465,987) !? To the user these are just categories > that are associated with too many documents than I > can digest anyway :). > > For that, we do sampling and display %, which is more consumable by users, > and then you don't need to worry about exact counts. > > I think that I wrote a bit too long an answer to your question :). > > Regarding not deleting categories, we've thought about it in the past and > I'm not sure it's a problem. I mean, in theory, yes, you could > end up w/ a taxonomy index that has many unused categories. But: > > * Whenever we were dealing with timestamp-based applications, at large > scale, they always created shards per time (e.g. per day / hour) > and when the taxonomy index is local to the shard, then it's gone > completely when the shard is gone. > > * You can always re-map the ordinals to new ones by running a side process > which checks which of the categories are unused, adds > those that are in use to a new taxonomy index and rewrites the payload > posing of the search index. It sounds expensive, but we've > never had to do it yet, so I don't know how much expensive. > > At the end of the day, the facets module lets you build the faceted search > that best suits your needs. It can work entirely off-disk, > it can be loaded in-memory (similar to FieldCache, Mike and I are working on > some improvements there - you're welcome to join!), it > can support exact counts or sampling, other aggregation methods than just > counting and many more. > > The sidecar taxonomy index is not as bad as it sounds. As I've told you, > many IBM products are working with it for many years, at small > and large scale. > > I think that Solr could benefit from this module too, and I hope that I > don't sound too biased :). > Having Solr reusing Lucene modules is important IMO. > > Shai > > > On Wed, Dec 12, 2012 at 1:12 AM, Lukáš Vlček <lukas.vl...@gmail.com> wrote: >> >> Hi Shai, >> >> thanks for your blog, I am looking forward to your future posts! >> >> Just two questions: you mentioned that you have been running this in >> production in distributed mode. If I understand it correctly the idea is >> there is only a single taxonomy index even if the distributed mode means >> that the data indices were partitioned/sharded. (Thus the ordinals are >> global). The taxonomy index is not partitioned/sharded itself. Am I correct? >> >> Also what seems to be an interesting implication of this implementation is >> the fact that taxonomy index never cares about deleted documents (categories >> that are obsolete). In practices this is probably not a bit deal because the >> taxonomy index is small but I can imagine this might be problematic in some >> situations (for example imagine that the categories would be based on highly >> granular timestamp, that could create a lot of categories over short period >> of time and those would be kept "forever" and still growing...). >> (^^ I am just trying to understand how it works.) >> >> Regards, >> Lukas > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org