even as a step it would be nice to have lucene's faceting exposed to
solr in a way that only works with a single node.

because it supports NRT, doesnt need to build up massive top-level
datastructures and so on, many people that currently need multiple
nodes might be able to work just fine with a single node.

On Wed, Dec 12, 2012 at 2:28 AM, Shai Erera <ser...@gmail.com> wrote:
> There are two ways you can work with the taxonomy index in a distributed
> environment (at least, these are the things that we've tried):
> (1) replicate the taxonomy to all shards, so that each shard sees the entire
> global taxonomy
> (2) each shard maintains its own taxonomy.
>
> (1) only makes sense when the shards are built by a side process,e.g.
> MapReduce, and then copied to their respective nodes.
> If you index like that, then your distributed faceted search (correcting the
> counts of categories) is done on ordinals rather than
> strings.
>
> (2) is the one that makes sense to most people, and is also NRT (where #1
> isn't!). Each shard maintains its local search +
> taxonomy indexes. In that mode, the counts correction cannot be done on
> ordinals, and has to be done on strings.
>
> When you're doing distributed faceted search, you cannot just ask each shard
> to return the top-10 categories for the "Author" dimension/facet,
> because then you might (1) miss a category that should have been in the
> total top-10 and (2) return a category in the top-10 with
> incorrect counts. What you can do is ask for C*10, where C is an
> over-counting factor. You'd still hope for the best though, b/c
> theoretically, you could still miss a category / have incorrect count for
> one.
>
> The difference between the two approaches is how big C can be. In the first
> approach, since all you transmit on the wire are
> integers, and the merge is done on integers, you can set C much higher than
> in the second approach. In practice though, since
> more and more applications are interested in real-time search, we keep a
> local taxonomy index per each shard and do the merge
> on the strings.
>
> Also, when you're doing really large-scale, exact counts for categories may
> not be so usable. How is Science Fiction (123,367,129)
> different than Drama (145,465,987) !? To the user these are just categories
> that are associated with too many documents than I
> can digest anyway :).
>
> For that, we do sampling and display %, which is more consumable by users,
> and then you don't need to worry about exact counts.
>
> I think that I wrote a bit too long an answer to your question :).
>
> Regarding not deleting categories, we've thought about it in the past and
> I'm not sure it's a problem. I mean, in theory, yes, you could
> end up w/ a taxonomy index that has many unused categories. But:
>
> * Whenever we were dealing with timestamp-based applications, at large
> scale, they always created shards per time (e.g. per day / hour)
>   and when the taxonomy index is local to the shard, then it's gone
> completely when the shard is gone.
>
> * You can always re-map the ordinals to new ones by running a side process
> which checks which of the categories are unused, adds
>   those that are in use to a new taxonomy index and rewrites the payload
> posing of the search index. It sounds expensive, but we've
>   never had to do it yet, so I don't know how much expensive.
>
> At the end of the day, the facets module lets you build the faceted search
> that best suits your needs. It can work entirely off-disk,
> it can be loaded in-memory (similar to FieldCache, Mike and I are working on
> some improvements there - you're welcome to join!), it
> can support exact counts or sampling, other aggregation methods than just
> counting and many more.
>
> The sidecar taxonomy index is not as bad as it sounds. As I've told you,
> many IBM products are working with it for many years, at small
> and large scale.
>
> I think that Solr could benefit from this module too, and I hope that I
> don't sound too biased :).
> Having Solr reusing Lucene modules is important IMO.
>
> Shai
>
>
> On Wed, Dec 12, 2012 at 1:12 AM, Lukáš Vlček <lukas.vl...@gmail.com> wrote:
>>
>> Hi Shai,
>>
>> thanks for your blog, I am looking forward to your future posts!
>>
>> Just two questions: you mentioned that you have been running this in
>> production in distributed mode. If I understand it correctly the idea is
>> there is only a single taxonomy index even if the distributed mode means
>> that the data indices were partitioned/sharded. (Thus the ordinals are
>> global). The taxonomy index is not partitioned/sharded itself. Am I correct?
>>
>> Also what seems to be an interesting implication of this implementation is
>> the fact that taxonomy index never cares about deleted documents (categories
>> that are obsolete). In practices this is probably not a bit deal because the
>> taxonomy index is small but I can imagine this might be problematic in some
>> situations (for example imagine that the categories would be based on highly
>> granular timestamp, that could create a lot of categories over short period
>> of time and those would be kept "forever" and still growing...).
>> (^^ I am just trying to understand how it works.)
>>
>> Regards,
>> Lukas
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to