Hey,

I've been musing about ideas for a "clever" Boolean field type on Lucene
for a while, and I think I might have an idea that could work. That said,
this popped into my head this afternoon and has not been fully-baked. It
may not be very clever at all.

My experience is that Boolean fields tend to be overwhelmingly true or
overwhelmingly false. I've had pretty good luck with using a keyword-style
field, where the only term represents the more sparse value. (For example,
I did a thing years ago with explicit tombstones, where versioned deletes
would have the field "deleted" with a value of "true", and live
documents didn't have the deleted field at all. Every query would add a
filter on "NOT deleted:true".)

That's great when you know up-front what the sparse value is going to be.
Working on OpenSearch, I just created an issue suggesting that we take a
hint from users for which value they think is going to be more common so we
only index the less common one:
https://github.com/opensearch-project/OpenSearch/issues/11143

At the Lucene level, though, we could index a Boolean field type as the
less common term when we flush (by counting the values and figuring out
which is less common). Then, per segment, we can rewrite any query for the
more common value as NOT the less common value.

You can compute upper/lower bounds on the value frequencies cheaply during
a merge, so I think you could usually write the doc IDs for the less common
value directly (without needing to count them first), even when input
segments disagree on which is the more common value.

If your Boolean field is not overwhelmingly lopsided, you might even want
to split segments to be 100% true or 100% false, such that queries against
the Boolean field become match-all or match-none. On a retail website,
maybe you have some toggle for "only show me results with property X" -- if
all your property X products are in one segment or a handful of segments,
you can drop the property X clause from the matching segments and skip the
other segments.

I guess one icky part of this compared to the usual Lucene field model is
that I'm assuming a Boolean field is never missing (or I guess missing
implies "false" by default?). Would that be a deal-breaker?

Thanks,
Froh

Reply via email to