codeant-ai-for-open-source[bot] commented on code in PR #40677:
URL: https://github.com/apache/superset/pull/40677#discussion_r3339159928


##########
superset/result_set.py:
##########
@@ -184,6 +186,23 @@ def __init__(  # pylint: disable=too-many-locals  # noqa: 
C901
                 TypeError,  # this is super hackey,
                 # https://issues.apache.org/jira/browse/ARROW-7855
             ):
+                col_values = array[column].tolist()
+                if 
is_feature_enabled("PRESERVE_NUMERIC_COLUMNS_FOR_SPECIAL_FLOATS"):
+                    col_values = [
+                        None if isinstance(v, str) and v in _FLOAT_SPECIAL 
else v
+                        for v in col_values
+                    ]
+                    try:
+                        pa_data.append(pa.array(col_values))
+                        continue

Review Comment:
   **Suggestion:** The special-float cleanup is applied globally to every 
engine whenever the feature flag is enabled, but this workaround is only valid 
for Druid/pydruid behavior. As written, enabling the flag for Druid can 
silently rewrite legitimate `"NaN"`/`"Infinity"` string values from other 
engines into nulls and change inferred Arrow types. Gate this branch by engine 
(for example, only when `db_engine_spec.engine == "druid"`) so non-Druid result 
sets are not mutated. [logic error]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ❌ Non-Druid queries mis-serialize "NaN"/Infinity strings as nulls.
   - ⚠️ SQL Lab and dashboards see unexpected numeric column types.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Enable the global feature flag so that
   
`superset.result_set.is_feature_enabled("PRESERVE_NUMERIC_COLUMNS_FOR_SPECIAL_FLOATS")`
   returns True; this is the condition checked in the Arrow error path at
   `superset/result_set.py:189-197`.
   
   2. Execute a SQL statement against a non-Druid database (for example, a 
Postgres
   `Database` whose engine spec is `PostgresEngineSpec` from
   `superset/db_engine_specs/postgres.py`) through SQL Lab; this flows through
   `SQLExecutor.execute()` to `execute_sql_with_cursor()` in
   `superset/sql/execution/executor.py:17-25`.
   
   3. In `execute_sql_with_cursor()`, when the cursor has a description, the 
code at
   `superset/sql/execution/executor.py:79-87` constructs a 
`SupersetResultSet(rows,
   description, database.db_engine_spec)`, passing the non-Druid 
`db_engine_spec` into
   `SupersetResultSet.__init__` in `superset/result_set.py:135-143`.
   
   4. If one of the result columns has mixed Python values like `["NaN", 1.5, 
"Infinity",
   2.3, "-Infinity", None]` (the same pattern used in
   `tests/unit_tests/result_set_test.py:45-58`), 
`pa.array(array[column].tolist())` at
   `superset/result_set.py:178-181` raises `ArrowInvalid` and falls into the 
exception block
   where `col_values` is set and, because the flag is enabled, the branch at
   `superset/result_set.py:189-197` replaces any string `"NaN"`, `"Infinity"`, 
or
   `"-Infinity"` with `None` for this non-Druid engine before re-running
   `pa.array(col_values)`, silently converting legitimate string values to 
nulls and changing
   the inferred Arrow type for that column.
   ```
   </details>
   
   [Fix in 
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=0e2b232739c34c809da0cc1b673c6ce6&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 | [Fix in VSCode 
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=0e2b232739c34c809da0cc1b673c6ce6&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/result_set.py
   **Line:** 190:197
   **Comment:**
        *Logic Error: The special-float cleanup is applied globally to every 
engine whenever the feature flag is enabled, but this workaround is only valid 
for Druid/pydruid behavior. As written, enabling the flag for Druid can 
silently rewrite legitimate `"NaN"`/`"Infinity"` string values from other 
engines into nulls and change inferred Arrow types. Gate this branch by engine 
(for example, only when `db_engine_spec.engine == "druid"`) so non-Druid result 
sets are not mutated.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40677&comment_hash=c69f351253e32e792e0a870b3116fcb7201dc89e18fed4cd0eeef50a9589065a&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40677&comment_hash=c69f351253e32e792e0a870b3116fcb7201dc89e18fed4cd0eeef50a9589065a&reaction=dislike'>👎</a>



##########
superset/result_set.py:
##########
@@ -277,13 +296,26 @@ def type_generic(
 
     def data_type(self, col_name: str, pa_dtype: pa.DataType) -> Optional[str]:
         """Given a pyarrow data type, Returns a generic database type"""
-        if set_type := self._type_dict.get(col_name):
+        set_type = self._type_dict.get(col_name)
+        pa_mapped = self.convert_pa_dtype(pa_dtype)
+
+        # pydruid infers column types from the first row value, so a None or
+        # special-float-string first value causes the column to be labelled
+        # STRING even when the actual data is numeric.  When the feature flag
+        # is enabled, prefer PyArrow's inferred type over a cursor-description
+        # STRING so that numeric columns are not misreported.
+        if (
+            is_feature_enabled("PRESERVE_NUMERIC_COLUMNS_FOR_SPECIAL_FLOATS")
+            and set_type == "STRING"
+            and pa_mapped is not None
+            and pa_mapped != "STRING"
+        ):
+            return pa_mapped
+

Review Comment:
   **Suggestion:** The metadata override in `data_type()` also runs for all 
engines, so any backend that reports `STRING` in cursor metadata but returns 
typed Python values can be reclassified to numeric/datetime unexpectedly when 
the flag is enabled. This breaks type-contract consistency outside Druid. 
Restrict this override to Druid result sets (or another explicit pydruid-only 
condition) before preferring PyArrow type inference. [api mismatch]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ❌ Non-Druid engines have metadata STRING columns retyped as numeric.
   - ⚠️ SQL Lab column metadata inconsistent with database cursor description.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. With the feature flag enabled so that
   `is_feature_enabled("PRESERVE_NUMERIC_COLUMNS_FOR_SPECIAL_FLOATS")` is True, 
construct a
   `SupersetResultSet` as done in
   
`tests/unit_tests/result_set_test.py:test_druid_none_first_value_reports_numeric_type_with_flag_enabled`
   (lines 87-104): pass `data = [(None,), (1.5,), (2.3,), (None,), (4.7,)]`, a 
cursor
   `description = [("metric", "STRING", None, None, None, None, None)]`, and a 
non-Druid
   engine spec such as `PostgresEngineSpec` from 
`superset/db_engine_specs/postgres.py` into
   `SupersetResultSet(data, description, PostgresEngineSpec)`.
   
   2. Inside `SupersetResultSet.__init__` (`superset/result_set.py:135-251`), 
`_type_dict` is
   populated at `superset/result_set.py:243-250` using
   `db_engine_spec.get_datatype(deduped_cursor_desc[i][1])`, so for this column 
`set_type` in
   `data_type()` becomes `"STRING"` based solely on the cursor metadata from 
the non-Druid
   engine.
   
   3. When SQL Lab or the async executor needs column metadata, 
`SupersetResultSet.columns`
   at `superset/result_set.py:73-88` iterates over the Arrow schema and calls
   `self.data_type(col.name, col.type)`; `data_type()` (lines 38-59) computes 
`pa_mapped =
   self.convert_pa_dtype(pa_dtype)`, which returns `"FLOAT"` for the `float64` 
Arrow type
   inferred from the actual data.
   
   4. Because the override condition at `superset/result_set.py:48-54` 
currently checks only
   the feature flag and `set_type == "STRING"` (without verifying that
   `self.db_engine_spec.engine` is `"druid"`), the branch returns `pa_mapped` 
(`"FLOAT"`) for
   this non-Druid engine, so `_serialize_result_set()` in
   `superset/sql/execution/celery_task.py:111-122` returns `columns = 
result_set.columns`
   with a `"FLOAT"` type even though the engine's cursor description declared 
`"STRING"`,
   causing type metadata for non-Druid backends to be reclassified unexpectedly 
whenever this
   flag is on.
   ```
   </details>
   
   [Fix in 
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=d654c945036941bd97cb24caefd2f5a7&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 | [Fix in VSCode 
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=d654c945036941bd97cb24caefd2f5a7&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/result_set.py
   **Line:** 308:314
   **Comment:**
        *Api Mismatch: The metadata override in `data_type()` also runs for all 
engines, so any backend that reports `STRING` in cursor metadata but returns 
typed Python values can be reclassified to numeric/datetime unexpectedly when 
the flag is enabled. This breaks type-contract consistency outside Druid. 
Restrict this override to Druid result sets (or another explicit pydruid-only 
condition) before preferring PyArrow type inference.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40677&comment_hash=96303c073287ecf142a80b2093d0e10558a4d47dc845f5f4af39b3ead73b956e&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40677&comment_hash=96303c073287ecf142a80b2093d0e10558a4d47dc845f5f4af39b3ead73b956e&reaction=dislike'>👎</a>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to