Re: [PR] perf(date_parser): bound regex quantifiers for deterministic parsing performance [superset]

via GitHub Thu, 08 Jan 2026 06:19:12 -0800


codeant-ai-for-open-source[bot] commented on code in PR #36983:
URL: https://github.com/apache/superset/pull/36983#discussion_r2672543444



##########
tests/unit_tests/utils/date_parser_tests.py:
##########
@@ -611,3 +611,48 @@ def test_date_range_migration() -> None:
 
     field = "10 years ago"
     assert not re.search(DateRangeMigration.x_dateunit, field)
+
+
+# Tests for bounded whitespace regex patterns in time_range_lookup
[email protected](
+    "time_range,should_parse",
+    [
+        ("last 7 days : ", True),
+        ("this week : ", True),
+        ("start of next month : ", True),
+        ("prior quarter : ", True),
+        ("last  7 days : ", True),
+        ("last   7 days : ", True),
+        ("last    7 days : ", True),
+        ("last     7 days : ", True),
+        ("last      7 days : ", False),  # 6 spaces - exceeds bound
+        ("start of     next     month : ", True),  # 5 spaces - valid
+        ("last week : ", True),
+        ("last  week : ", True),
+        ("last     week : ", True),
+        ("next 12 months : ", True),
+        ("next  12  months : ", True),
+        ("next     12     months : ", True),
+        ("last7days : ", False),  # no space after scope - invalid
+        ("last 7days : ", True),  # \s{0,5} allows 0 spaces after number - 
valid
+        ("lastweek : ", False),  # no space after scope - invalid
+        ("last" + " " * 100 + "7 days : ", False),

Review Comment:
   **Suggestion:** Unclear/overly-large test input: the parametrized case using 
string concatenation to build a 100-space string is noisy and hides the intent 
(the test aims to check >5 spaces). Replace the 100-space expression with an 
explicit 6-space example to be precise and maintainable. [possible bug]
   
   **Severity Level:** Critical 🚨
   ```suggestion
           ("last      7 days : ", False),  # 6 spaces - exceeds bound
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   Replacing the noisy 100-space concatenation with a concise explicit example 
(e.g. 6 spaces) improves readability and maintainability. There's already an 
explicit 6-space test in the list, so collapsing the redundant long-case to a 
clear representative is sensible and safe.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** tests/unit_tests/utils/date_parser_tests.py
   **Line:** 639:639
   **Comment:**
        *Possible Bug: Unclear/overly-large test input: the parametrized case 
using string concatenation to build a 100-space string is noisy and hides the 
intent (the test aims to check >5 spaces). Replace the 100-space expression 
with an explicit 6-space example to be precise and maintainable.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
superset/utils/date_parser.py:
##########
@@ -418,9 +418,9 @@ def get_since_until(  # pylint: 
disable=too-many-arguments,too-many-locals,too-m
     if time_range and separator in time_range:
         time_range_lookup = [
             (
-                r"^(start of|beginning of|end of)\s+"
-                r"(this|last|next|prior)\s+"
-                r"([0-9]+)?\s*"
+                r"^(start of|beginning of|end of)\s{1,5}"
+                r"(this|last|next|prior)\s{1,5}"
+                r"([0-9]+)?\s{0,5}"

Review Comment:
   **Suggestion:** Regression in whitespace handling: the first regex bounds 
whitespace to at most 5 characters (using `\s{1,5}` and `\s{0,5}`), which is 
more restrictive than the prior behavior (`\s+`/`\s*`) and will fail to match 
valid inputs containing larger or variable whitespace sequences; restore the 
more permissive quantifiers to preserve expected matching. [logic error]
   
   **Severity Level:** Minor ⚠️
   ```suggestion
                   r"^(start of|beginning of|end of)\s+"
                   r"(this|last|next|prior)\s+"
                   r"([0-9]+)?\s*"
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   Changing \s+ and \s* to bounded \s{1,5}/\s{0,5} is a regression in 
permissiveness:
   inputs with unusual spacing (or tabs) could fail to match. Restoring \s+/\s*
   preserves previous behavior and is a safe correction.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/utils/date_parser.py
   **Line:** 421:423
   **Comment:**
        *Logic Error: Regression in whitespace handling: the first regex bounds 
whitespace to at most 5 characters (using `\s{1,5}` and `\s{0,5}`), which is 
more restrictive than the prior behavior (`\s+`/`\s*`) and will fail to match 
valid inputs containing larger or variable whitespace sequences; restore the 
more permissive quantifiers to preserve expected matching.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
superset/utils/date_parser.py:
##########
@@ -431,8 +431,8 @@ def get_since_until(  # pylint: 
disable=too-many-arguments,too-many-locals,too-m
                 ),
             ),
             (
-                r"^(this|last|next|prior)\s+"
-                r"([0-9]+)?\s*"
+                r"^(this|last|next|prior)\s{1,5}"
+                r"([0-9]+)?\s{0,5}"
                 r"(second|minute|day|week|month|quarter|year)s?$",  # Matches 
"next 5 days" or "last 2 weeks" # noqa: E501

Review Comment:
   **Suggestion:** Logic bug: the second regex pattern that matches scopes like 
"next 5 hours" omits the `hour` unit, so expressions containing "hour"/"hours" 
will not match and will fall back to the default parsing path, causing 
incorrect parsing or failure for time ranges expressed in hours. [logic error]
   
   **Severity Level:** Minor ⚠️
   ```suggestion
                   r"(second|minute|hour|day|week|month|quarter|year)s?$",  # 
Matches "next 5 days" or "last 2 weeks" # noqa: E501
   ```
   <details>
   <summary><b>Why it matters? ⭐ </b></summary>
   
   This is a correct and actionable bug report: the current regex omits 
"hour"/"hours"
   so inputs like "last 3 hours" won't match this branch and will fall back to 
less
   appropriate parsing. Adding "hour" to the alternation fixes the logic without
   side effects.
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/utils/date_parser.py
   **Line:** 436:436
   **Comment:**
        *Logic Error: Logic bug: the second regex pattern that matches scopes 
like "next 5 hours" omits the `hour` unit, so expressions containing 
"hour"/"hours" will not match and will fall back to the default parsing path, 
causing incorrect parsing or failure for time ranges expressed in hours.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf(date_parser): bound regex quantifiers for deterministic parsing performance [superset]

Reply via email to