Hi Adrien,
> the lucene module requires users to decide at indexing time what and how
to facet
> whereas Solr does everything at searching time
True, that's one difference between the two implementations today, even
though I think that we can create a specialized path (under LUCENE-4619)
for really simple, non-hierarchical cases.
I don't know if and how Solr can handle a field value
Sport/Basketball/NBA/... -- i.e., how is the hierarchy broken?
I imagine that there's no magic done here. Assuming that Solr can handle it
(and I think I read somewhere that it does handle hierarchical facets?),
you've got to specify somewhere that this field's values should be broken
on '/' and that you'd like to facet on it? Or at least you need to say
"create me a hierarchy from it"?
But I think that in Lucene we can add a FlatFacetsField, so that you
initialize it like new FlatFacetsField("Author", "Shai") and it will create
the implicit hierarchy Author/Shai.
Or, we can add a FieldType.facet(), and if the field is a StringField (i.e.
indexed, not tokenized), then we create the implicit hierarchy fName/fValue?
Just throwing an idea.. that's basically the purpose of LUCENE-4619. Come
up w/ even simpler starter-level API for really simple cases.
Making a decision at search time that you'd like to facet on a field ...
well I think that not doing that is what allows us to do efficient faceted
search, off-disk or in-memory, support really large indexes and taxonomies
and be NRT.
>From the little I know and read, this is one drawback of Solr facets? But
if not, don't be too harsh in your reply, I'm not trying to pass any
judgement here :).
> - do you have any rough idea of how speed and memory usage vary
> depending on the number of docs to collect, distinct field values,
> etc. ?
As per tests show (I think on LUCENE-4602, but I'm starting to lose track
of all the new issues :)), when you load the facets info into memory,
performance improves. Still, I think that if you're going to count facets
on millions of documents, it's not going to be efficient, no matter where
they are. Loading them into memory will speed things of course, but also
consume more RAM.
That's why we can sample facets, to get the approximate top-K very fast and
then per your decision, you can do a 2nd pass to correct the approximate
weights, or return them as is, e.g. in the form %tg.
> TaxonomyReader seems to use ints as ordinals for category paths,
> does it mean that the faceting module can't handle paths that have
> more than 2B distinct values? Is it fixable? (Or maybe it doesn't make
> sense to handle such large numbers of distinct values?)
That's right, it's a limitation, but I haven't a taxonomy that is that big.
I've worked w/ several teams which had really huge taxonomies, I'm talking
in the order of 10M nodes, but that doesn't even scratch the MAX_INT limit
right?
I guess that we can change the taxonomy to support long ordinals, but I
think that managing a taxonomy that size is going to pose plenty of other
limitations first. Probably much sooner than you'd hit the MAX_INT limit :).
I.e., today we count the facets in memory, which is one contiguous array of
integers. If it's too large, you can choose to partition the ordinal space
into smaller sets.
But even if a partition is of size 1M, or 10M, I don't think that counting
200+ partitions makes sense (compares to e.g. read 200 posting lists).
So I think that if anyone would want to really manage taxonomies of that
size, we'd need to discuss and maybe get back to the drawing board :).
Shai
On Thu, Dec 13, 2012 at 2:03 PM, Adrien Grand <[email protected]> wrote:
> Hi Shai,
>
> On Thu, Dec 13, 2012 at 12:21 PM, Shai Erera <[email protected]> wrote:
> > As I said, if someone volunteers to do some work on the Solr side, I will
> > gladly participate in that effort.
> > I just don't even know where to start w/ Solr :).
>
> The entry point for Solr facets is
> org.apache.solr.request.SimpleFacets.getFacetCounts (called from
> FacetComponent).
>
> > One thing that would be really great is if we can build an adapter (I
> think
> > someone mentioned that word here)
> > which supports basic facets capabilities, so that we can at least
> benchmark
> > Solr's current
> > implementation vs the implementation w/ the module.
>
> Comparing both impls would be great but an adapter might be hard to
> write given how Lucene faceting differs from Solr faceting: the lucene
> module requires users to decide at indexing time what and how to facet
> whereas Solr does everything at searching time (there is even an issue
> open in order to be able to compute facet counts based on arbitray
> functions [1]) using FieldCache and UninvertedField (meaning that you
> can compute facets on any field that is indexed). So Lucene faceting
> would probably require an additional field property in the schema to
> let Solr know that it should add category paths to documents? (Please
> correct me if anything I wrote here is wrong).
>
> I have a few questions regarding the faceting module:
> - do you have any rough idea of how speed and memory usage vary
> depending on the number of docs to collect, distinct field values,
> etc. ?
> - TaxonomyReader seems to use ints as ordinals for category paths,
> does it mean that the faceting module can't handle paths that have
> more than 2B distinct values? Is it fixable? (Or maybe it doesn't make
> sense to handle such large numbers of distinct values?)
>
> [1] https://issues.apache.org/jira/browse/SOLR-1581
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>