Computing facets on the fly is interesting. Indeed, if you want to use the
taxonomy index, you have to plan for this in advance, by say adding each
term to the taxonomy under '/' and ask to count '/'. If your index is
static, then not being able to delete from the taxonomy index won't be a
problem.

But, perhaps another approach would work too: you could reuse the code flow
of Lucene facets to feed it ordinals from another source. I don't know if
this is what UninvertField does, but if it is, then what you'd need to
write is a CategoryListIterator to fetch the ordinals from the cache,
rather than the payload (or DV, see our work on LUCENE-4602). You'd then
get for free (I think?) the top-k computation, sampling (not critical in
the 300K docs index) and even complement counting. You'll also need to
write a FacetResultsHandler, or extend TopK in order to label the top
ordinals from the other source.

At least from the software perspective, I think this should work, but I
haven't tried it !

As for the numbers, 3.1M nodes in the taxonomy (it'd be a flat taxonomy,
all under '/' I guess), that should scale. But in the general case, where
indexes can have millions of terms (in the hudreds range), then off the top
of my head I don't see how it can scale. Maybe if you limit it to a certain
field, with low cardinality ... or maybe if you store term vectors and
limit the faceting on the vectors of the docs in the result set. But I need
to think about it some more as neither of these options seems great to me
now.

Shai


On Fri, Dec 14, 2012 at 12:31 AM, Smiley, David W. <dsmi...@mitre.org>wrote:

>  I second this use-case.  This is my only concern with Solr faceting —
> Solr's UnInvertedField on the search index to discover frequently used
> words.  It doesn't scale well.  Shai; do you think  this would scale?  FWIW
> one of my indexes with only 300k docs has ~3.1M terms — not a lot but it's
> a number to consider.
>
>  ~ David
>
>   From: "Adrien Grand [via Lucene]" <
> ml-node+s472066n4026847...@n3.nabble.com>
>
>  Hi Shai,
>
> Thanks for your answers!
>
>  …
>
>
> > So I think that if anyone would want to really manage taxonomies of that
> > size, we'd need to discuss and maybe get back to the drawing board :).
>
> One use-case I'm thinking of is finding the top terms of documents
> that match an arbitrary query. This can be very useful to help you
> better understand your data, but in this case the number of distinct
> values is the size of your term dictionary.
>
> --
> Adrien
>
>

Reply via email to