Re: Adding a new PointDocValuesField

Marc D'Mello Wed, 25 May 2022 10:05:24 -0700

>
> But adding a new type should be the last resort.


I did not realize that was the case, that's good to know. It seems like I
should just use BDV (which does make the code change easier/faster so I
have no issues with it).

As for Patrick's suggestion of using separate numeric fields instead of
packing them together, that actually does sound like an interesting idea, I
think the biggest issue with it though would be implementing a multivalued
version of this. As Robert pointed out, we would need an UnsortedNumericDV.

Thanks for all the feedback!


On Wed, May 25, 2022 at 8:17 AM Robert Muir <rcm...@gmail.com> wrote:

> On Wed, May 25, 2022 at 12:17 AM Greg Miller <gsmil...@gmail.com> wrote:
> >
> >  A "two separate field approach" would
> > consist of indexing year and make separately, and you'd lose the
> > information that only certain combinations are valid. Am I overlooking
> > something with your suggestion? Maybe there's something we can do with
> > Lucene already that solves for this case and I'm just not aware of it?
> > That's entirely possible and I'd love to learn more if there is!
>
> This makes no sense to me. If there are two dimensions, there's no
> difference in faceting code calling fieldA.value and fieldB.value,
> than calling field.valueA and field.valueB.
>
> In other words, doesn't make any sense to needlessly "pack dimensions
> together" at docvalues level, especially for what should be a
> column-stride field. There's really no difference from the app
> perspective. Any issues you have here seem to be issues around facet
> module and not docvalues...
>
> >
> > As for MultiRangeQuery and the mention of sandbox modules, I think
> > that's a bit of a different use-case. MultiRangeQuery lets you filter
> > by a disjunction of ranges. The "multi" part doesn't relate to
> > "multiple values in a doc" (but it does support that, as do the
> > "standard" range queries).
> >
> > Where I see a gap right now, beyond just faceting, is that we can
> > represent N-dim points in the points index and filter on them (using
> > the points index), but we have no doc values equivalent. This means,
> > 1) we can't facet, and 2) we can't create a "slow" query that does
> > post-filtering instead of using the points index (which could be a
> > very real advantage in cases with a sparse match set but a dense
> > points index). So I like the idea of creating that concept and being
> > able to facet and filter on it. Whether-or-not this is a "formal" doc
> > values type or sits on top of BDV, I have less of a strong opinion.
>
> We shouldn't add new docvalues types because of "slow queries", I'm
> really against that. The root problem is that points impl can't filter
> well (like the inverted index can), and as a hack, docvalues "picks up
> the slack". If its becoming a major issue, address this with points
> directly?
>
> >
> > And finally... it really should be multi-valued. The points index
> > supports multiple points-per-field within a single document. Seems
> > like a big gap that we wouldn't support that with a doc value field.
> > Because BDV is inherently single-valued, I propose we come up with an
> > encoding scheme that encodes multiple points on top of that "single"
> > BDV entry. This is where building on BDV started to feel a little icky
> > to me and it seemed like it might be a good use-case for actually
> > formalizing a format/encoding, but again, no strong preference. We
> > could certainly do something more quickly on top of BDV and formalize
> > an encoding later if/as necessary.
>
> Doesn't matter that points index supports it. Do the use-cases make
> sense? It's especially stupid that e.g. LatLonDocValueField supports
> multi-values. Really? What kind of quantum documents are in multiple
> locations at the same time?
>
> The sortedset/sortednumeric exist to support use-cases on String and
> int, where user wants to "sort on a multivalued field", which is
> really crazy if you think about it. So they both sort the numbers at
> index-time, so that you can pick a "representative" value
> (min/max/median) in constant time. I think a lot of this existing
> stuff is just brain-damage from the no-sql fads, alternatively we
> could remove this multivalued nonsense and the crazy servers that want
> to follow no-sql fads could index just the "representative value"
> (min/max/median) in a single-valued field.
>
> Sorry, I'm just not seeing a lot of strong use-cases here to justify
> creating a new DV field, which we should really avoid, as its a hugely
> expensive cost. I would recommend prototyping stuff with
> BinaryDocValues, using the sandbox, etc. See if the features get
> popular and people use them.
>
> If they really "catch on", and we think its more efficient, then we
> can think about how the stuff could be best encoded/compressed/etc.
> But adding a new type should be the last resort. Adding some
> specialized multi-dimensional type is IMO out of the question. It
> would be a lot less horrible to just use separate DV fields, one for
> each dimension. If there is *strong* compelling use-cases for
> multi-valued stuff, then in the worst case we could think about
> something like a UnsortedNumericDV, which would allow fieldA[0] to
> align with fieldB[0] and fieldA[1] to align with fieldB[1], which
> would solve the issue for faceting. Just don't allow sorting. And
> probably not any "slow" query stuff too.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: Adding a new PointDocValuesField

Reply via email to