Hello Michael.
This optimization "NOT the less common value" assumes that boolean field is
required, but how to enforce this mandatory field constraint in Lucene? I'm
not aware of something like Solr schema or mapping.
If saying foo:true is common, it means that the posting list goes like
dense sequentially increasing numbers 1,2,3,4,5.. May it already be
compressed by codecs like
https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/util/packed/MonotonicBlockPackedWriter.html
?

On Thu, Nov 9, 2023 at 3:31 AM Michael Froh <msf...@gmail.com> wrote:

> Hey,
>
> I've been musing about ideas for a "clever" Boolean field type on Lucene
> for a while, and I think I might have an idea that could work. That said,
> this popped into my head this afternoon and has not been fully-baked. It
> may not be very clever at all.
>
> My experience is that Boolean fields tend to be overwhelmingly true or
> overwhelmingly false. I've had pretty good luck with using a keyword-style
> field, where the only term represents the more sparse value. (For example,
> I did a thing years ago with explicit tombstones, where versioned deletes
> would have the field "deleted" with a value of "true", and live
> documents didn't have the deleted field at all. Every query would add a
> filter on "NOT deleted:true".)
>
> That's great when you know up-front what the sparse value is going to be.
> Working on OpenSearch, I just created an issue suggesting that we take a
> hint from users for which value they think is going to be more common so we
> only index the less common one:
> https://github.com/opensearch-project/OpenSearch/issues/11143
>
> At the Lucene level, though, we could index a Boolean field type as the
> less common term when we flush (by counting the values and figuring out
> which is less common). Then, per segment, we can rewrite any query for the
> more common value as NOT the less common value.
>
> You can compute upper/lower bounds on the value frequencies cheaply during
> a merge, so I think you could usually write the doc IDs for the less common
> value directly (without needing to count them first), even when input
> segments disagree on which is the more common value.
>
> If your Boolean field is not overwhelmingly lopsided, you might even want
> to split segments to be 100% true or 100% false, such that queries against
> the Boolean field become match-all or match-none. On a retail website,
> maybe you have some toggle for "only show me results with property X" -- if
> all your property X products are in one segment or a handful of segments,
> you can drop the property X clause from the matching segments and skip the
> other segments.
>
> I guess one icky part of this compared to the usual Lucene field model is
> that I'm assuming a Boolean field is never missing (or I guess missing
> implies "false" by default?). Would that be a deal-breaker?
>
> Thanks,
> Froh
>


-- 
Sincerely yours
Mikhail Khludnev

Reply via email to