using spans and wildcards together is asking for trouble, you will hit
limits, it is not efficient by definition.

I'd recommend to change your indexing so that your queries are fast
and you aren't using wildcards that enumerate many terms at
search-time.
Don't index words such as "bar_294e50e1-fc3c-450f-a04f-7b4ad79587d6"
and then use wildcards to match just "bar".
Instead add a synonym "bar" (or similar, whatever you want) to
"bar_294e50e1-fc3c-450f-a04f-7b4ad79587d6"
This way you can match it with ordinary termquery: "bar"

e.g. for your simple example, this would look approximately like this:
instead of: abc foo bar_" + UUID.randomUUID()
index something like: abc foo bar bar_" + UUID.randomUUID()

but if you use an analyzer, then
bar_294e50e1-fc3c-450f-a04f-7b4ad79587d6 and its synonym "bar" will
sit at the same position, so your spans/sloppy-phrases will work fine.

On Thu, Mar 28, 2024 at 11:37 AM Yixun Xu <yix...@gmail.com> wrote:
>
> Hello,
>
> We are trying to search for phrases where the last term is a prefix match.
> For example, find all documents that contain "foo bar.*", with a
> configurable slop between "foo" and "bar". We were able to do this using
> `SpanNearQuery` where the last clause is a `SpanMultiTermQueryWrapper` that
> wraps a `PrefixQuery`. However, this seems to run into the limit of 1024
> clauses very quickly if the last term appears as a common prefix in the
> index.
>
> I have a branch that reproduces the query at
> https://github.com/apache/lucene/compare/main...yixunx:yx/span-query-limit?expand=1,
> and also pasted the code below.
>
> It seems that if slop = 0 then we can use `MultiPhraseQuery` instead, which
> doesn't hit the clause limit. For the slop != 0 case, is it intended that
> `SpanMultiTermQueryWrapper` can easily hit the clause limit, or am I using
> the queries wrong? Is there a workaround other than increasing
> `maxClauseCount`?
>
> Thank you for the help!
>
> ```java
> public class TestSpanNearQueryClauseLimit extends LuceneTestCase {
>
>     private static final String FIELD_NAME = "field";
>     private static final int NUM_DOCUMENTS = 1025;
>
>     /**
>      * Creates an index with NUM_DOCUMENTS documents. Each document has a
> text field in the form of "abc foo bar_[UUID]".
>      */
>     private Directory createIndex() throws Exception {
>         Directory dir = newDirectory();
>         try (IndexWriter writer = new IndexWriter(dir, new
> IndexWriterConfig())) {
>             for (int i = 0; i < NUM_DOCUMENTS; i++) {
>                 Document doc = new Document();
>                 doc.add(new TextField("field", "abc foo bar_" +
> UUID.randomUUID(), Field.Store.YES));
>                 writer.addDocument(doc);
>             }
>             writer.commit();
>         }
>         return dir;
>     }
>
>     public void testSpanNearQueryClauseLimit() throws Exception {
>         Directory dir = createIndex();
>
>         // Find documents that match "abc <some term> bar.*", which should
> match all documents.
>         try (IndexReader reader = DirectoryReader.open(dir)) {
>             Query query = new SpanNearQuery.Builder(FIELD_NAME, true)
>                     .setSlop(1)
>                     .addClause(new SpanTermQuery(new Term(FIELD_NAME,
> "abc")))
>                     .addClause(new SpanMultiTermQueryWrapper<>(new
> PrefixQuery(new Term(FIELD_NAME, "bar"))))
>                     .build();
>
>             // This throws exception if NUM_DOCUMENTS is > 1024.
>             // ```
>             // org.apache.lucene.search.IndexSearcher$TooManyNestedClauses:
> Query contains too many nested clauses;
>             // maxClauseCount is set to 1024
>             // ```
>             TopDocs docs = new IndexSearcher(reader).search(query, 10);
>             System.out.println(docs.totalHits);
>         }
>
>         dir.close();
>     }
> }
> ```
>
> Thank you,
> Yixun Xu

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to