Hi Michael/Mikhails, yet another Mike here: If you create a NumericDocValuesField, and it only ever has one value per doc (0, 1), I think the default Codec will compress it well, though maybe not as well as your idea. It's a neat idea to notice a "very common default value" and not store it and just store the sparse non-default values. I don't think Codec does that today.
For the search-time opto, I thought somewhere in Lucene we do something like your idea, converting to a negation if it has lower estimated cardinality than the positive form. It might only be for points? If the field were stored as postings, you could consult the metadata in the terms dictionary to know the cardinality of each case, and perhaps that the field is single valued and fully set (no missing values), at which point your optimization logic might be able to apply during rewrite maybe? Mike McCandless http://blog.mikemccandless.com On Fri, Nov 10, 2023 at 6:05 PM Michael Froh <msf...@gmail.com> wrote: > Thanks Mikhail and Mike! > > Mikhail, since you replied, I remembered your work on block joins in Solr > (thank you for that, by the way!), which reminded me that it's not unusual > for docs in a Lucene index to "mix" their schemata, like in parent/child > blocks. If 90% of parent docs are "true" on a Boolean field, but the field > doesn't exist for the child docs, my suggested approach would probably see > "true" as the sparse value (assuming there are at least as many children as > parents). Ideally, I would want to only track the "false" parents (and > leave the field off of the children). > > Indeed the idea of a "required" field in Lucene is tricky (though Mike's > suggestion of missing defaults could help). Even worse, I think we'd also > need a way to enforce "exactly one value", since a "Boolean" term field can > really have four states -- true, false, neither, or both. > > It feels like there's not a workable short-term solution to implement > something like this as a regular IndexableField in Lucene (or at least I'm > not seeing it). > > I don't think I see enough value to pitch the idea of adding a new > field-like "thing" (that would have exactly one value for every doc, and > maybe could be counted relative to docs in a block). For now, I think it's > probably only practical to implement something like this as part of a > schema definition in one of the higher-level search servers. > > Thanks for the discussion! > Froh > > On Thu, Nov 9, 2023 at 5:01 AM Michael Sokolov <msoko...@gmail.com> wrote: > >> Can you require the user to specify missing: true or missing: false >> semantics. With that you can decide what to do with the missing values >> >> On Thu, Nov 9, 2023, 7:55 AM Mikhail Khludnev <m...@apache.org> wrote: >> >>> Hello Michael. >>> This optimization "NOT the less common value" assumes that boolean field >>> is required, but how to enforce this mandatory field constraint in Lucene? >>> I'm not aware of something like Solr schema or mapping. >>> If saying foo:true is common, it means that the posting list goes like >>> dense sequentially increasing numbers 1,2,3,4,5.. May it already be >>> compressed by codecs like >>> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/util/packed/MonotonicBlockPackedWriter.html >>> ? >>> >>> On Thu, Nov 9, 2023 at 3:31 AM Michael Froh <msf...@gmail.com> wrote: >>> >>>> Hey, >>>> >>>> I've been musing about ideas for a "clever" Boolean field type on >>>> Lucene for a while, and I think I might have an idea that could work. That >>>> said, this popped into my head this afternoon and has not been fully-baked. >>>> It may not be very clever at all. >>>> >>>> My experience is that Boolean fields tend to be overwhelmingly true or >>>> overwhelmingly false. I've had pretty good luck with using a keyword-style >>>> field, where the only term represents the more sparse value. (For example, >>>> I did a thing years ago with explicit tombstones, where versioned deletes >>>> would have the field "deleted" with a value of "true", and live >>>> documents didn't have the deleted field at all. Every query would add a >>>> filter on "NOT deleted:true".) >>>> >>>> That's great when you know up-front what the sparse value is going to >>>> be. Working on OpenSearch, I just created an issue suggesting that we take >>>> a hint from users for which value they think is going to be more common so >>>> we only index the less common one: >>>> https://github.com/opensearch-project/OpenSearch/issues/11143 >>>> >>>> At the Lucene level, though, we could index a Boolean field type as the >>>> less common term when we flush (by counting the values and figuring out >>>> which is less common). Then, per segment, we can rewrite any query for the >>>> more common value as NOT the less common value. >>>> >>>> You can compute upper/lower bounds on the value frequencies cheaply >>>> during a merge, so I think you could usually write the doc IDs for the less >>>> common value directly (without needing to count them first), even when >>>> input segments disagree on which is the more common value. >>>> >>>> If your Boolean field is not overwhelmingly lopsided, you might even >>>> want to split segments to be 100% true or 100% false, such that queries >>>> against the Boolean field become match-all or match-none. On a retail >>>> website, maybe you have some toggle for "only show me results with property >>>> X" -- if all your property X products are in one segment or a handful of >>>> segments, you can drop the property X clause from the matching segments and >>>> skip the other segments. >>>> >>>> I guess one icky part of this compared to the usual Lucene field model >>>> is that I'm assuming a Boolean field is never missing (or I guess missing >>>> implies "false" by default?). Would that be a deal-breaker? >>>> >>>> Thanks, >>>> Froh >>>> >>> >>> >>> -- >>> Sincerely yours >>> Mikhail Khludnev >>> >>