SpanMultiTermQueryWrapper with PrefixQuery hitting num clause limit

Yixun Xu Thu, 28 Mar 2024 08:37:37 -0700

Hello,

We are trying to search for phrases where the last term is a prefix match.
For example, find all documents that contain "foo bar.*", with a
configurable slop between "foo" and "bar". We were able to do this using
`SpanNearQuery` where the last clause is a `SpanMultiTermQueryWrapper` that
wraps a `PrefixQuery`. However, this seems to run into the limit of 1024
clauses very quickly if the last term appears as a common prefix in the
index.


I have a branch that reproduces the query at
https://github.com/apache/lucene/compare/main...yixunx:yx/span-query-limit?expand=1,
and also pasted the code below.

It seems that if slop = 0 then we can use `MultiPhraseQuery` instead, which
doesn't hit the clause limit. For the slop != 0 case, is it intended that
`SpanMultiTermQueryWrapper` can easily hit the clause limit, or am I using
the queries wrong? Is there a workaround other than increasing
`maxClauseCount`?

Thank you for the help!

```java
public class TestSpanNearQueryClauseLimit extends LuceneTestCase {

    private static final String FIELD_NAME = "field";
    private static final int NUM_DOCUMENTS = 1025;

    /**
     * Creates an index with NUM_DOCUMENTS documents. Each document has a
text field in the form of "abc foo bar_[UUID]".
     */
    private Directory createIndex() throws Exception {
        Directory dir = newDirectory();
        try (IndexWriter writer = new IndexWriter(dir, new
IndexWriterConfig())) {
            for (int i = 0; i < NUM_DOCUMENTS; i++) {
                Document doc = new Document();
                doc.add(new TextField("field", "abc foo bar_" +
UUID.randomUUID(), Field.Store.YES));
                writer.addDocument(doc);
            }
            writer.commit();
        }
        return dir;
    }

    public void testSpanNearQueryClauseLimit() throws Exception {
        Directory dir = createIndex();

        // Find documents that match "abc <some term> bar.*", which should
match all documents.
        try (IndexReader reader = DirectoryReader.open(dir)) {
            Query query = new SpanNearQuery.Builder(FIELD_NAME, true)
                    .setSlop(1)
                    .addClause(new SpanTermQuery(new Term(FIELD_NAME,
"abc")))
                    .addClause(new SpanMultiTermQueryWrapper<>(new
PrefixQuery(new Term(FIELD_NAME, "bar"))))
                    .build();

            // This throws exception if NUM_DOCUMENTS is > 1024.
            // ```
            // org.apache.lucene.search.IndexSearcher$TooManyNestedClauses:
Query contains too many nested clauses;
            // maxClauseCount is set to 1024
            // ```
            TopDocs docs = new IndexSearcher(reader).search(query, 10);
            System.out.println(docs.totalHits);
        }

        dir.close();
    }
}
```

Thank you,
Yixun Xu

SpanMultiTermQueryWrapper with PrefixQuery hitting num clause limit

Reply via email to