dimas-b commented on code in PR #4075:
URL: https://github.com/apache/polaris/pull/4075#discussion_r3024674127
##########
client/python/apache_polaris/cli/command/utils.py:
##########
@@ -64,3 +69,140 @@ def format_timestamp(ms_since_epoch: int) -> str:
ms_since_epoch / 1000, tz=datetime.timezone.utc
)
return dt.strftime("%Y-%m-%d %H:%M:%S UTC")
+
+
+def is_fuzzy_match(query: str, target: str, threshold: float = 0.85) -> bool:
+ """
+ Determine if a query matches a target using multi-stage fuzzy strategies
and case-insensitive.
+ """
+ if not query:
+ return False
+ q = query.lower()
+ t = target.lower()
+ query_len = len(q)
+ # Exact match
+ if q == t:
+ return True
+ # Prefix match
+ if t.startswith(q):
+ return True
+ # Substring match: enabled for length > 1
+ if query_len > 1 and q in t:
+ return True
+ # Subsequence match: enabled for length > 2
+ if query_len > 2:
+ iterator = iter(t)
+ if all(char in iterator for char in q):
Review Comment:
In my personal opinion, matching `max` to `mixed bag of exceptions` (the
subsequence rule) is noise too :sweat_smile: TBH, I do not see "logic" behind
this rule :sweat_smile:
I'd use `SequenceMatcher` immediately if exact substring matches do not
yield `True`, but use different thresholds depending on the query string size
to reduce noise.
However, like I said, I do not feel strongly about this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]