drempapis commented on PR #16031:
URL: https://github.com/apache/lucene/pull/16031#issuecomment-4381374667
> I don't think these are good realistic test strings. It will inflate the
numbers due to huge alphabet size. How many words have 20+ unique letters in
them?
Fair point, I re-ran with alphabet in {4, 8, 12, 26} so the realistic case
(~ 8 unique letters) and the worst case (26) are both visible. Per-build
retained bytes (sum of CompiledAutomaton.ramBytesUsed() across the maxEdits+1
automata):
```
termLength edits alpha=4 alpha=8 alpha=12 alpha=26
16 1 ~36 KB ~41 KB ~46 KB ~50 KB
32 1 ~69 KB ~78 KB ~87 KB ~118 KB
64 1 ~134 KB ~151 KB ~169 KB ~231 KB
128 1 ~264 KB ~299 KB ~333 KB ~456 KB
256 1 ~523 KB ~593 KB ~662 KB ~906 KB
16 2 ~230 KB ~259 KB ~288 KB ~316 KB
32 2 ~466 KB ~525 KB ~583 KB ~789 KB
64 2 ~938 KB ~1.06 MB ~1.18 MB ~1.59 MB
128 2 ~1.88 MB ~2.12 MB ~2.36 MB ~3.19 MB
256 2 ~3.77 MB ~4.25 MB ~4.73 MB ~6.39 MB
```
```
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
@Fork(value = 1, warmups = 1)
@Warmup(iterations = 3, time = 1)
@Measurement(iterations = 5, time = 1)
public class FuzzyAutomatonRamBenchmark {
@Param({"16", "32", "64", "128", "256"})
public int termLength;
@Param({"1", "2"})
public int maxEdits;
@Param({"4", "8", "12", "26"})
public int alphabet;
private FuzzyQuery query;
@AuxCounters(AuxCounters.Type.EVENTS)
@State(Scope.Thread)
public static class AutomataRam {
public long ramBytes;
}
@Setup(Level.Trial)
public void setup() {
if (alphabet < 1 || alphabet > 26) {
throw new IllegalArgumentException("alphabet must be in [1, 26], got:
" + alphabet);
}
StringBuilder sb = new StringBuilder(termLength);
for (int i = 0; i < termLength; i++) {
sb.append((char) ('a' + (i % alphabet)));
}
query =
new FuzzyQuery(
new Term("f", sb.toString()),
maxEdits,
0,
FuzzyQuery.defaultMaxExpansions,
true);
}
@Benchmark
public long buildAutomata(AutomataRam metrics) {
long bytes = query.computeAutomataRamBytes(new AttributeSource());
metrics.ramBytes = bytes;
return bytes;
}
}
```
- The alphabet does inflate the constant, but not by much. From realistic
(alphabet=8) to saturated (alphabet=26) the bytes grow by ~30–50%. Going from 4
to 8 grows them by ~10–15%. So my earlier numbers were biased, but not by a
multiple, the shape and order of magnitude are the same.
- The two scaling properties hold at every alphabet: bytes are linear in
termLength (each doubling of length doubles the bytes, within a few percent),
and going from maxEdits=1 to maxEdits=2 is a bounded ~7× regardless of alphabet.
So the cost is driven by `termLength` and `maxEdits`. The realistic case is
still meaningful in absolute terms. A 64-character term with 8 unique letters
at maxEdits=2 retains ~1 MB per build, and a 128-character term retains ~2 MB.
These are realistic shapes for many fields people index as keyword and run
fuzzy queries against:
- Identifiers: UUIDs / ISBNS and any kind of IDs
- URLs and file paths
- Hostnames and DNS labels
- Tokenized phrases
- etc
"Long fuzzy terms" is not a corner case. On a busy node with many concurrent
fuzzy queries, and more stuff running in parallel, the per-request automata
cost is real, and today it is invisible to per-request accounting. This PR is
tied to a concrete use case where a single query includes many tokens, and some
of those tokens are large.
I agree that ideally the user should fix their tokenization and search
strategy. But on the server side we cannot count on that. The implementation
has to be defensive and protect the node regardless of how clients shape their
queries, that is exactly what request-scoped accounting is for.
If you see a better way to expose this, I'm open to it, happy to change
direction.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]