MonkeyCanCode commented on code in PR #4075:
URL: https://github.com/apache/polaris/pull/4075#discussion_r3024602485


##########
client/python/apache_polaris/cli/command/utils.py:
##########
@@ -64,3 +69,140 @@ def format_timestamp(ms_since_epoch: int) -> str:
         ms_since_epoch / 1000, tz=datetime.timezone.utc
     )
     return dt.strftime("%Y-%m-%d %H:%M:%S UTC")
+
+
+def is_fuzzy_match(query: str, target: str, threshold: float = 0.85) -> bool:
+    """
+    Determine if a query matches a target using multi-stage fuzzy strategies 
and case-insensitive.
+    """
+    if not query:
+        return False
+    q = query.lower()
+    t = target.lower()
+    query_len = len(q)
+    # Exact match
+    if q == t:
+        return True
+    # Prefix match
+    if t.startswith(q):
+        return True
+    # Substring match: enabled for length > 1
+    if query_len > 1 and q in t:
+        return True
+    # Subsequence match: enabled for length > 2
+    if query_len > 2:
+        iterator = iter(t)
+        if all(char in iterator for char in q):

Review Comment:
   This was added to avoid FP noise. For example, if we allow `SequenceMatcher` 
on any character lengths, a single letter `a` will match anything contains 
letter `a`. 
   Thus, what I thought was following:
   * len 1: only exact or prefix match 
   * len 2: add substring match (q in t)
   * len 3: add subsequence match
   * len 4+: similarity ratio check via SequenceMatcher
   
   When I was testing this earlier with setup setup, allow similarity search on 
len 3 is too noise. Thus, I added subsequence match here instead. But it is not 
really necessary if a bit noise output is acceptable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to