There are two ways you can work with the taxonomy index in a distributed
environment (at least, these are the things that we've tried):
(1) replicate the taxonomy to all shards, so that each shard sees the
entire global taxonomy
(2) each shard maintains its own taxonomy.

(1) only makes sense when the shards are built by a side process,e.g.
MapReduce, and then copied to their respective nodes.
If you index like that, then your distributed faceted search (correcting
the counts of categories) is done on ordinals rather than
strings.

(2) is the one that makes sense to most people, and is also NRT (where #1
isn't!). Each shard maintains its local search +
taxonomy indexes. In that mode, the counts correction cannot be done on
ordinals, and has to be done on strings.

When you're doing distributed faceted search, you cannot just ask each
shard to return the top-10 categories for the "Author" dimension/facet,
because then you might (1) miss a category that should have been in the
total top-10 and (2) return a category in the top-10 with
incorrect counts. What you can do is ask for C*10, where C is an
over-counting factor. You'd still hope for the best though, b/c
theoretically, you could still miss a category / have incorrect count for
one.

The difference between the two approaches is how big C can be. In the first
approach, since all you transmit on the wire are
integers, and the merge is done on integers, you can set C much higher than
in the second approach. In practice though, since
more and more applications are interested in real-time search, we keep a
local taxonomy index per each shard and do the merge
on the strings.

Also, when you're doing really large-scale, exact counts for categories may
not be so usable. How is Science Fiction (123,367,129)
different than Drama (145,465,987) !? To the user these are just categories
that are associated with too many documents than I
can digest anyway :).

For that, we do sampling and display %, which is more consumable by users,
and then you don't need to worry about exact counts.

I think that I wrote a bit too long an answer to your question :).

Regarding not deleting categories, we've thought about it in the past and
I'm not sure it's a problem. I mean, in theory, yes, you could
end up w/ a taxonomy index that has many unused categories. But:

* Whenever we were dealing with timestamp-based applications, at large
scale, they always created shards per time (e.g. per day / hour)
  and when the taxonomy index is local to the shard, then it's gone
completely when the shard is gone.

* You can always re-map the ordinals to new ones by running a side process
which checks which of the categories are unused, adds
  those that are in use to a new taxonomy index and rewrites the payload
posing of the search index. It sounds expensive, but we've
  never had to do it yet, so I don't know how much expensive.

At the end of the day, the facets module lets you build the faceted search
that best suits your needs. It can work entirely off-disk,
it can be loaded in-memory (similar to FieldCache, Mike and I are working
on some improvements there - you're welcome to join!), it
can support exact counts or sampling, other aggregation methods than just
counting and many more.

The sidecar taxonomy index is not as bad as it sounds. As I've told you,
many IBM products are working with it for many years, at small
and large scale.

I think that Solr could benefit from this module too, and I hope that I
don't sound too biased :).
Having Solr reusing Lucene modules is important IMO.

Shai


On Wed, Dec 12, 2012 at 1:12 AM, Lukáš Vlček <[email protected]> wrote:

> Hi Shai,
>
> thanks for your blog, I am looking forward to your future posts!
>
> Just two questions: you mentioned that you have been running this in
> production in distributed mode. If I understand it correctly the idea is
> there is only a single taxonomy index even if the distributed mode means
> that the data indices were partitioned/sharded. (Thus the ordinals are
> global). The taxonomy index is not partitioned/sharded itself. Am I correct?
>
> Also what seems to be an interesting implication of this implementation is
> the fact that taxonomy index never cares about deleted documents
> (categories that are obsolete). In practices this is probably not a bit
> deal because the taxonomy index is small but I can imagine this might be
> problematic in some situations (for example imagine that the categories
> would be based on highly granular timestamp, that could create a lot of
> categories over short period of time and those would be kept "forever" and
> still growing...).
> (^^ I am just trying to understand how it works.)
>
> Regards,
> Lukas
>

Reply via email to