[
https://issues.apache.org/jira/browse/LUCENE-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16052985#comment-16052985
]
Erick Erickson commented on LUCENE-7880:
----------------------------------------
Adrien:
We simply cannot create arbitrary limits and say "you should learn to do a
better search and not need to ever exceed X" because we cannot anticipate the
"meta issues" Solr or Lucene users have to deal with.
Here are some use-cases I've seen in the wild:
I have seen very complex queries in legal situations that run to 20K + raw
input before rewrite. Ditto pharma applications. If you're investing 750M in
drug development you want to be very, very, very sure that there is no prior
art you'd violate. These organizations pay people who create amazingly complex
queries and mix-n-match them to insure that they're not infringing on a patent.
I've seen these organizations create 4K reusable sub-queries that they then
combine in various ways to accomplish their goals. A dozen at a time. Or more.
It gets worse in the legal arena. I can and have said from an IR standpoint,
why are you searching for (runn OR runni OR runnin OR running) Why not just
stem? Ans: Because I can defend this search. I cannot explain in court that "we
did (or didn't) find document X because the search engine we're using uses a
stemming algorithm that did (or didn't) match". And there may be hundreds and
hundreds of such clauses.
Genome matching. Believe it or not I've worked with people using Solr/Lucene
for matching genome sequences. The complexity here is beyond belief. Machine
generated to be sure.
These kinds of domains often do not care how long a search takes. Well, not
quite true but for some applications "10 minutes? Great!". I once worked with
an organization that had, prior to Solr, a 48 hour turn-around and were happy
with anything less than an hour and could live with longer ones upon occasion.
Not very many users, true. So my reflexive reaction of "well, these will be
expensive queries, it may take some time to return" is invalid.
So you see why I'll -1 imposing limits that aren't absolutely necessary every
time.
> Make boolean query clause limit configurable per-query
> ------------------------------------------------------
>
> Key: LUCENE-7880
> URL: https://issues.apache.org/jira/browse/LUCENE-7880
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Yonik Seeley
>
> As we know, the magic BooleanQuery.maxClauseCount has bitten many people over
> time.
> It's also a static, which really hurts multi-tenancy (i.e. we can't have
> different settings for different users, clients, or use-cases).
> If we want to keep this static as a default, then at least we should allow it
> to be overridden on a per-query basis when we know it is the desired behavior
> and not a bug.
> Perhaps the simplest way to achieve this would be a setter on
> BooleanQuery.Builder that configures the limit for that instance only?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]