On Wed, May 25, 2022 at 12:17 AM Greg Miller <gsmil...@gmail.com> wrote: > > A "two separate field approach" would > consist of indexing year and make separately, and you'd lose the > information that only certain combinations are valid. Am I overlooking > something with your suggestion? Maybe there's something we can do with > Lucene already that solves for this case and I'm just not aware of it? > That's entirely possible and I'd love to learn more if there is!
This makes no sense to me. If there are two dimensions, there's no difference in faceting code calling fieldA.value and fieldB.value, than calling field.valueA and field.valueB. In other words, doesn't make any sense to needlessly "pack dimensions together" at docvalues level, especially for what should be a column-stride field. There's really no difference from the app perspective. Any issues you have here seem to be issues around facet module and not docvalues... > > As for MultiRangeQuery and the mention of sandbox modules, I think > that's a bit of a different use-case. MultiRangeQuery lets you filter > by a disjunction of ranges. The "multi" part doesn't relate to > "multiple values in a doc" (but it does support that, as do the > "standard" range queries). > > Where I see a gap right now, beyond just faceting, is that we can > represent N-dim points in the points index and filter on them (using > the points index), but we have no doc values equivalent. This means, > 1) we can't facet, and 2) we can't create a "slow" query that does > post-filtering instead of using the points index (which could be a > very real advantage in cases with a sparse match set but a dense > points index). So I like the idea of creating that concept and being > able to facet and filter on it. Whether-or-not this is a "formal" doc > values type or sits on top of BDV, I have less of a strong opinion. We shouldn't add new docvalues types because of "slow queries", I'm really against that. The root problem is that points impl can't filter well (like the inverted index can), and as a hack, docvalues "picks up the slack". If its becoming a major issue, address this with points directly? > > And finally... it really should be multi-valued. The points index > supports multiple points-per-field within a single document. Seems > like a big gap that we wouldn't support that with a doc value field. > Because BDV is inherently single-valued, I propose we come up with an > encoding scheme that encodes multiple points on top of that "single" > BDV entry. This is where building on BDV started to feel a little icky > to me and it seemed like it might be a good use-case for actually > formalizing a format/encoding, but again, no strong preference. We > could certainly do something more quickly on top of BDV and formalize > an encoding later if/as necessary. Doesn't matter that points index supports it. Do the use-cases make sense? It's especially stupid that e.g. LatLonDocValueField supports multi-values. Really? What kind of quantum documents are in multiple locations at the same time? The sortedset/sortednumeric exist to support use-cases on String and int, where user wants to "sort on a multivalued field", which is really crazy if you think about it. So they both sort the numbers at index-time, so that you can pick a "representative" value (min/max/median) in constant time. I think a lot of this existing stuff is just brain-damage from the no-sql fads, alternatively we could remove this multivalued nonsense and the crazy servers that want to follow no-sql fads could index just the "representative value" (min/max/median) in a single-valued field. Sorry, I'm just not seeing a lot of strong use-cases here to justify creating a new DV field, which we should really avoid, as its a hugely expensive cost. I would recommend prototyping stuff with BinaryDocValues, using the sandbox, etc. See if the features get popular and people use them. If they really "catch on", and we think its more efficient, then we can think about how the stuff could be best encoded/compressed/etc. But adding a new type should be the last resort. Adding some specialized multi-dimensional type is IMO out of the question. It would be a lot less horrible to just use separate DV fields, one for each dimension. If there is *strong* compelling use-cases for multi-valued stuff, then in the worst case we could think about something like a UnsortedNumericDV, which would allow fieldA[0] to align with fieldB[0] and fieldA[1] to align with fieldB[1], which would solve the issue for faceting. Just don't allow sorting. And probably not any "slow" query stuff too. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org