Re: Adding a new PointDocValuesField

Robert Muir Wed, 25 May 2022 08:17:15 -0700

On Wed, May 25, 2022 at 12:17 AM Greg Miller <[email protected]> wrote:
>
>  A "two separate field approach" would
> consist of indexing year and make separately, and you'd lose the
> information that only certain combinations are valid. Am I overlooking
> something with your suggestion? Maybe there's something we can do with
> Lucene already that solves for this case and I'm just not aware of it?
> That's entirely possible and I'd love to learn more if there is!


This makes no sense to me. If there are two dimensions, there's no
difference in faceting code calling fieldA.value and fieldB.value,
than calling field.valueA and field.valueB.

In other words, doesn't make any sense to needlessly "pack dimensions
together" at docvalues level, especially for what should be a
column-stride field. There's really no difference from the app
perspective. Any issues you have here seem to be issues around facet
module and not docvalues...

>
> As for MultiRangeQuery and the mention of sandbox modules, I think
> that's a bit of a different use-case. MultiRangeQuery lets you filter
> by a disjunction of ranges. The "multi" part doesn't relate to
> "multiple values in a doc" (but it does support that, as do the
> "standard" range queries).
>
> Where I see a gap right now, beyond just faceting, is that we can
> represent N-dim points in the points index and filter on them (using
> the points index), but we have no doc values equivalent. This means,
> 1) we can't facet, and 2) we can't create a "slow" query that does
> post-filtering instead of using the points index (which could be a
> very real advantage in cases with a sparse match set but a dense
> points index). So I like the idea of creating that concept and being
> able to facet and filter on it. Whether-or-not this is a "formal" doc
> values type or sits on top of BDV, I have less of a strong opinion.

We shouldn't add new docvalues types because of "slow queries", I'm
really against that. The root problem is that points impl can't filter
well (like the inverted index can), and as a hack, docvalues "picks up
the slack". If its becoming a major issue, address this with points
directly?

>
> And finally... it really should be multi-valued. The points index
> supports multiple points-per-field within a single document. Seems
> like a big gap that we wouldn't support that with a doc value field.
> Because BDV is inherently single-valued, I propose we come up with an
> encoding scheme that encodes multiple points on top of that "single"
> BDV entry. This is where building on BDV started to feel a little icky
> to me and it seemed like it might be a good use-case for actually
> formalizing a format/encoding, but again, no strong preference. We
> could certainly do something more quickly on top of BDV and formalize
> an encoding later if/as necessary.

Doesn't matter that points index supports it. Do the use-cases make
sense? It's especially stupid that e.g. LatLonDocValueField supports
multi-values. Really? What kind of quantum documents are in multiple
locations at the same time?

The sortedset/sortednumeric exist to support use-cases on String and
int, where user wants to "sort on a multivalued field", which is
really crazy if you think about it. So they both sort the numbers at
index-time, so that you can pick a "representative" value
(min/max/median) in constant time. I think a lot of this existing
stuff is just brain-damage from the no-sql fads, alternatively we
could remove this multivalued nonsense and the crazy servers that want
to follow no-sql fads could index just the "representative value"
(min/max/median) in a single-valued field.

Sorry, I'm just not seeing a lot of strong use-cases here to justify
creating a new DV field, which we should really avoid, as its a hugely
expensive cost. I would recommend prototyping stuff with
BinaryDocValues, using the sandbox, etc. See if the features get
popular and people use them.

If they really "catch on", and we think its more efficient, then we
can think about how the stuff could be best encoded/compressed/etc.
But adding a new type should be the last resort. Adding some
specialized multi-dimensional type is IMO out of the question. It
would be a lot less horrible to just use separate DV fields, one for
each dimension. If there is *strong* compelling use-cases for
multi-valued stuff, then in the worst case we could think about
something like a UnsortedNumericDV, which would allow fieldA[0] to
align with fieldB[0] and fieldA[1] to align with fieldB[1], which
would solve the issue for faceting. Just don't allow sorting. And
probably not any "slow" query stuff too.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Adding a new PointDocValuesField

Reply via email to