drempapis commented on PR #16031:
URL: https://github.com/apache/lucene/pull/16031#issuecomment-4381374667

   > I don't think these are good realistic test strings. It will inflate the 
numbers due to huge alphabet size. How many words have 20+ unique letters in 
them?
   
   Fair point,  I re-ran with alphabet in {4, 8, 12, 26} so the realistic case 
(~ 8 unique letters) and the worst case (26) are both visible. Per-build 
retained bytes (sum of CompiledAutomaton.ramBytesUsed() across the maxEdits+1 
automata):
   
   ```
   termLength  edits  alpha=4   alpha=8   alpha=12  alpha=26
       16        1     ~36 KB    ~41 KB    ~46 KB    ~50 KB
       32        1     ~69 KB    ~78 KB    ~87 KB   ~118 KB
       64        1    ~134 KB   ~151 KB   ~169 KB   ~231 KB
      128        1    ~264 KB   ~299 KB   ~333 KB   ~456 KB
      256        1    ~523 KB   ~593 KB   ~662 KB   ~906 KB
       16        2    ~230 KB   ~259 KB   ~288 KB   ~316 KB
       32        2    ~466 KB   ~525 KB   ~583 KB   ~789 KB
       64        2    ~938 KB  ~1.06 MB  ~1.18 MB  ~1.59 MB
      128        2   ~1.88 MB  ~2.12 MB  ~2.36 MB  ~3.19 MB
      256        2   ~3.77 MB  ~4.25 MB  ~4.73 MB  ~6.39 MB
   ```
   
   ```
   @BenchmarkMode(Mode.AverageTime)
   @OutputTimeUnit(TimeUnit.MICROSECONDS)
   @State(Scope.Benchmark)
   @Fork(value = 1, warmups = 1)
   @Warmup(iterations = 3, time = 1)
   @Measurement(iterations = 5, time = 1)
   public class FuzzyAutomatonRamBenchmark {
   
     @Param({"16", "32", "64", "128", "256"})
     public int termLength;
   
     @Param({"1", "2"})
     public int maxEdits;
   
     @Param({"4", "8", "12", "26"})
     public int alphabet;
   
     private FuzzyQuery query;
   
     @AuxCounters(AuxCounters.Type.EVENTS)
     @State(Scope.Thread)
     public static class AutomataRam {
       public long ramBytes;
     }
   
     @Setup(Level.Trial)
     public void setup() {
       if (alphabet < 1 || alphabet > 26) {
         throw new IllegalArgumentException("alphabet must be in [1, 26], got: 
" + alphabet);
       }
       StringBuilder sb = new StringBuilder(termLength);
       for (int i = 0; i < termLength; i++) {
         sb.append((char) ('a' + (i % alphabet)));
       }
       query =
           new FuzzyQuery(
               new Term("f", sb.toString()),
               maxEdits,
               0,
               FuzzyQuery.defaultMaxExpansions,
               true);
     }
   
     @Benchmark
     public long buildAutomata(AutomataRam metrics) {
       long bytes = query.computeAutomataRamBytes(new AttributeSource());
       metrics.ramBytes = bytes;
       return bytes;
     }
   }
   ```
   
   - The alphabet does inflate the constant, but not by much. From realistic 
(alphabet=8) to saturated (alphabet=26) the bytes grow by ~30–50%. Going from 4 
to 8 grows them by ~10–15%. So my earlier numbers were biased, but not by a 
multiple, the shape and order of magnitude are the same.
   
   - The two scaling properties hold at every alphabet: bytes are linear in 
termLength (each doubling of length doubles the bytes, within a few percent), 
and going from maxEdits=1 to maxEdits=2 is a bounded ~7× regardless of alphabet.
   
   So the cost is driven by `termLength` and `maxEdits`. The realistic case is 
still meaningful in absolute terms. A 64-character term with 8 unique letters 
at maxEdits=2 retains ~1 MB per build, and a 128-character term retains ~2 MB. 
These are realistic shapes for many fields people index as keyword and run 
fuzzy queries against:
   - Identifiers: UUIDs / ISBNS and any kind of IDs
   - URLs and file paths
   - Hostnames and DNS labels
   - Tokenized phrases
   - etc
   
   "Long fuzzy terms" is not a corner case. On a busy node with many concurrent 
fuzzy queries, and more stuff running in parallel, the per-request automata 
cost is real, and today it is invisible to per-request accounting. This PR is 
tied to a concrete use case where a single query includes many tokens, and some 
of those tokens are large.
   
   I agree that ideally the user should fix their tokenization and search 
strategy. But on the server side we cannot count on that. The implementation 
has to be defensive and protect the node regardless of how clients shape their 
queries, that is exactly what request-scoped accounting is for.
   
   If you see a better way to expose this, I'm open to it, happy to change 
direction.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to