puranjay2597 opened a new pull request, #60799:
URL: https://github.com/apache/doris/pull/60799

   
   Adds 5 new built-in scalar functions for fuzzy string matching and 
similarity scoring, useful for record deduplication, search ranking, and data 
quality workflows:
   
   | Function | Return type | Description |
   |---|---|---|
   | `levenshtein(str1, str2)` | INT |  |
   | `damerau_levenshtein(str1, str2)` | INT | True Damerau-Levenshtein 
distance (unrestricted transpositions) |
   | `jaro_winkler(str1, str2)` | DOUBLE | Jaro-Winkler similarity [0.0, 1.0] |
   | `cosine_similarity(str1, str2)` | DOUBLE | Cosine similarity on 
byte-frequency vectors [0.0, 1.0] |
   | `jaccard_similarity(str1, str2)` | DOUBLE | Jaccard similarity on byte 
bigrams [0.0, 1.0] |
   
   All functions accept `VARCHAR` and `STRING` inputs and propagate NULL.
   
   ### Release note
   
   Add 5 built-in string similarity/distance functions: `levenshtein`, 
`damerau_levenshtein`, `jaro_winkler`, `cosine_similarity`, 
`jaccard_similarity`.
   
   ### Check List (For Author)
   
   - Test
       - [x] Manual test (add detailed scripts or steps below)
   
   ```sql
   -- Levenshtein
   SELECT levenshtein('kitten', 'sitting');          -- 3
   SELECT levenshtein('', 'abc');                    -- 3
   
   -- Damerau-Levenshtein (true, not OSA)
   SELECT damerau_levenshtein('ca', 'abc');          -- 2  (transposition 
ca->ac + insert b)
   SELECT damerau_levenshtein('kitten', 'sitting');  -- 3
   
   -- Jaro-Winkler
   SELECT jaro_winkler('MARTHA', 'MARHTA');          -- ~0.9611
   SELECT jaro_winkler('abc', 'abc');                -- 1.0
   
   -- Cosine similarity
   SELECT cosine_similarity('hello', 'hello');       -- 1.0
   SELECT cosine_similarity('hello', 'world');       -- ~0.516
   
   -- Jaccard similarity
   SELECT jaccard_similarity('night', 'nacht');      -- ~0.222
   SELECT jaccard_similarity('abc', 'abc');          -- 1.0
   
   ```
   
   ### Performance
   
   Benchmarked against equivalent Java UDFs on Apache Doris 4.0.2
   
   **Test setup:** 100M rows × 96-char strings (md5-generated), `SELECT 
count(func(a, b))` to isolate compute from I/O.
   
   | Rows | C++ builtin | Java UDF | Speedup |
   |---|---|---|---|
   | 1M | 1.5s | 6.5s | 4.3x* |
   | 100M | 2m 20s | 4m 25s | ~1.9x |
   | 1B | 19m | 36m | ~1.89x |
   | 400M production rows (repeated query) | 0.11s | 96s | ~870x |
   
   \* 1M speedup is JVM-warmup-inflated; 100M and 1B represent fully 
JIT-compiled steady state.
   
   Key observations:
   - **Consistent ~2x speedup** at scale (100M and 1B confirm the JVM reaches 
full JIT compilation and the advantage stabilises).
   - **Result cache**: C++ builtins benefit from Doris SQL-layer result caching 
keyed on query text + data version. Java UDFs bypass the result cache entirely 
and always recompute — explaining the ~870x gap on repeated queries.
   - **Stack allocation**: `levenshtein`, `jaro_winkler`, and 
`jaccard_similarity` use stack-allocated arrays for strings ≤512 bytes, 
eliminating heap allocation for the common case.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to