Re: [PR] feat(datasets): ai assistant for queries [superset]

via GitHub Wed, 19 Nov 2025 05:55:29 -0800


bito-code-review[bot] commented on code in PR #36187:
URL: https://github.com/apache/superset/pull/36187#discussion_r2542115030



##########
superset/aiassistant/sql_generator.py:
##########
@@ -0,0 +1,154 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""AI-powered SQL generation service using HuggingFace models."""
+
+import logging
+from typing import Any
+
+from flask import current_app
+
+logger = logging.getLogger(__name__)
+
+
+class SQLGeneratorService:
+    """Service for generating SQL from natural language using AI models."""
+
+    @staticmethod
+    def get_api_token() -> str | None:
+        """
+        Get HuggingFace API token from Superset config.
+
+        Returns:
+            API token string or None if not configured
+        """
+        return current_app.config.get("HF_API_TOKEN") or 
current_app.config.get(
+            "HF_TOKEN"
+        )
+
+    @staticmethod
+    def get_model() -> str:
+        """
+        Get HuggingFace model from Superset config.
+
+        Returns:
+            Model name string, defaults to Qwen/Qwen2.5-Coder-32B-Instruct
+        """
+        return current_app.config.get("HF_MODEL") or 
"Qwen/Qwen2.5-Coder-32B-Instruct"
+
+    @staticmethod
+    def generate_sql(
+        user_query: str, schema_info: str, db_dialect: str = "SQL"
+    ) -> dict[str, Any]:
+        """
+        Generate SQL from natural language query.
+
+        Args:
+            user_query: Natural language question from the user
+            schema_info: Schema information about the dataset
+            db_dialect: Database dialect (e.g., "PostgreSQL", "MySQL", 
"SQLite")
+
+        Returns:
+            Dictionary with 'sql' and optional 'error' keys
+        """
+        try:
+            from huggingface_hub import InferenceClient
+        except ImportError:
+            logger.error("huggingface_hub package is not installed")
+            return {
+                "error": (
+                    "AI SQL generation is not available. "
+                    "Missing huggingface_hub package."
+                )
+            }
+
+        api_token = SQLGeneratorService.get_api_token()
+
+        if not api_token:
+            logger.warning(
+                "No HuggingFace API token configured. "
+                "Set HF_API_TOKEN in superset_config.py"
+            )
+            return {
+                "error": (
+                    "AI SQL generation is not configured. "
+                    "Please set HF_API_TOKEN in superset_config.py"
+                )
+            }
+
+        try:
+            # Initialize the inference client
+            client = InferenceClient(token=api_token)
+
+            # Construct the messages for chat completion
+            system_prompt = (
+                f"{schema_info}\n\nIMPORTANT: Generate {db_dialect} dialect 
SQL. "
+                f"Use {db_dialect}-specific syntax and functions."
+            )
+            messages = [
+                {"role": "system", "content": system_prompt},
+                {
+                    "role": "user",
+                    "content": (
+                        f"Convert this request into {db_dialect} SQL: 
{user_query}"
+                    ),
+                },
+            ]
+
+            logger.info(
+                "Generating SQL for query: %s...", user_query[:100]
+            )  # Log first 100 chars
+
+            # Get the model to use
+            model = SQLGeneratorService.get_model()
+            logger.info("Using HuggingFace model: %s", model)
+
+            # Call the model using chat completion
+            response = client.chat_completion(
+                messages=messages,
+                model=model,
+                max_tokens=500,
+                temperature=0.1,  # Low temperature for deterministic output
+            )
+
+            # Extract the SQL query
+            sql_query = response.choices[0].message.content.strip()

Review Comment:
   
   <div>
   
   
   <div id="suggestion">
   <div id="issue"><b>Missing response validation in AI API call</b></div>
   <div id="fix">
   
   The code directly accesses `response.choices[0].message.content` without 
verifying that the `choices` list exists and contains at least one element. If 
the HuggingFace API returns a response without choices, this will raise an 
`IndexError`, causing the SQL generation to fail unexpectedly. In the 
`generate_sql` method, the API endpoints in `superset/aiassistant/api.py` and 
`superset/views/aiassistant.py` that call this function will receive an 
unhandled exception instead of a proper error response. To fix this, add a 
check after the API call to ensure `response.choices` is not empty before 
proceeding.
   </div>
   
   <details>
   <summary>
   <b>Code suggestion</b>
   </summary>
   <blockquote>Check the AI-generated fix before applying</blockquote>
   <div id="code">
   
   
   ````suggestion
               )
   
               # Validate response structure
               if not response.choices or len(response.choices) == 0:
                   logger.error("Invalid response from AI model: no choices 
returned")
                   return {"error": "AI model returned invalid response"}
   
               # Extract the SQL query
               sql_query = response.choices[0].message.content.strip()
   ````
   
   </div>
   </details>
   
   
   </div>
   
   
   
   
   <small><i>Code Review Run <a 
href=https://github.com/apache/superset/pull/36187#issuecomment-3552782654>#565ab4</a></i></small>
   </div>
   
   ---
   Should Bito avoid suggestions like this for future reviews? (<a 
href=https://alpha.bito.ai/home/ai-agents/review-rules>Manage Rules</a>)
   - [ ] Yes, avoid them



##########
superset/views/aiassistant.py:
##########
@@ -0,0 +1,90 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import logging
+from typing import Any
+
+from flask import request
+from flask_appbuilder import permission_name
+from flask_appbuilder.api import expose
+from flask_appbuilder.security.decorators import has_access
+
+from superset import event_logger
+from superset.aiassistant.sql_generator import SQLGeneratorService
+from superset.constants import MODEL_API_RW_METHOD_PERMISSION_MAP
+from superset.superset_typing import FlaskResponse
+
+from .base import BaseSupersetView
+
+logger = logging.getLogger(__name__)
+
+
+class AIAssistantView(BaseSupersetView):
+    """View for the AI Assistant page."""
+
+    route_base = "/aiassistant"
+    class_permission_name = "AIAssistant"
+
+    method_permission_name = MODEL_API_RW_METHOD_PERMISSION_MAP
+
+    @expose("/", methods=["GET"])
+    @has_access
+    @permission_name("read")
+    @event_logger.log_this
+    def root(self, **kwargs: Any) -> FlaskResponse:
+        """Handles the AI Assistant page."""
+        return self.render_app_template()
+
+    @expose("/generate_sql", methods=["POST"])
+    @has_access
+    @permission_name("read")
+    @event_logger.log_this
+    def generate_sql(self) -> FlaskResponse:
+        """
+        Generate SQL from natural language query using AI.
+
+        Expected JSON payload:
+        {
+            "user_query": "Show me all active users",
+            "schema_info": "The dataset 'users' has columns: id (INTEGER), ..."
+        }
+
+        Returns:
+            JSON response with 'sql' or 'error' key
+        """
+        try:
+            data = request.get_json(force=True)
+            user_query = data.get("user_query")
+            schema_info = data.get("schema_info")
+
+            if not user_query or not schema_info:
+                return self.json_response(
+                    {"error": "Missing required fields: user_query and 
schema_info"},
+                    status=400,
+                )
+
+            # Generate SQL using the AI service
+            result = SQLGeneratorService.generate_sql(user_query, schema_info)
+
+            if "error" in result:
+                logger.warning("AI SQL generation failed")
+                return self.json_response(result, status=400)
+
+            return self.json_response(result)
+
+        except Exception as e:
+            logger.exception("Error in AI SQL generation endpoint")
+            return self.json_response({"error": str(e)}, status=500)

Review Comment:
   
   <div>
   
   
   <div id="suggestion">
   <div id="issue"><b>Insecure exception handling in generate_sql</b></div>
   <div id="fix">
   
   The exception handler in the `generate_sql` method returns the full 
exception message `str(e)` to the client, which could expose sensitive internal 
details like database connection strings or stack traces in production. While 
the `logger.exception` call properly logs the full error for debugging, the 
client response should use a generic error message to prevent information 
disclosure. This affects the `generate_sql` endpoint and any downstream 
consumers of the AI SQL generation feature.
   </div>
   
   <details>
   <summary>
   <b>Code suggestion</b>
   </summary>
   <blockquote>Check the AI-generated fix before applying</blockquote>
   <div id="code">
   
   
   ````suggestion
           except Exception as e:
               logger.exception("Error in AI SQL generation endpoint")
               return self.json_response({"error": "Internal server error"}, 
status=500)
   ````
   
   </div>
   </details>
   
   
   </div>
   
   
   
   
   <small><i>Code Review Run <a 
href=https://github.com/apache/superset/pull/36187#issuecomment-3552782654>#565ab4</a></i></small>
   </div>
   
   ---
   Should Bito avoid suggestions like this for future reviews? (<a 
href=https://alpha.bito.ai/home/ai-agents/review-rules>Manage Rules</a>)
   - [ ] Yes, avoid them



##########
superset-frontend/src/pages/AIAssistant/index.tsx:
##########
@@ -0,0 +1,714 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+import { useEffect, useState } from 'react';
+import { SupersetClient, t } from '@superset-ui/core';
+import { styled, useTheme } from '@apache-superset/core/ui';
+import {
+  Row,
+  Col,
+  Card,
+  Input,
+  Select,
+  Button,
+} from '@superset-ui/core/components';
+
+const { TextArea } = Input;
+
+const PageContainer = styled.div`
+  padding: 48px;
+  max-width: 1400px;
+  margin: 0 auto;
+`;
+
+const PageTitle = styled.h1`
+  font-size: 32px;
+  font-weight: 600;
+  margin-bottom: 48px;
+  display: flex;
+  align-items: center;
+  gap: 16px;
+`;
+
+const StyledCard = styled(Card)`
+  margin-bottom: 32px;
+  box-shadow: 0 2px 8px rgba(0, 0, 0, 0.08);
+`;
+
+const ChatContainer = styled.div`
+  ${({ theme }) => `
+    min-height: 400px;
+    max-height: 400px;
+    overflow-y: auto;
+    padding: 32px;
+    border: 1px solid ${theme.colorBorder};
+    border-radius: 8px;
+    background-color: ${theme.colorBgLayout};
+    margin-bottom: 32px;
+  `}
+`;
+
+const ChatMessage = styled.div<{ isUser?: boolean }>`
+  ${({ theme, isUser }) => `
+    padding: 24px;
+    margin-bottom: 16px;
+    border-radius: 8px;
+    background-color: ${isUser ? theme.colorInfoBg : theme.colorBgContainer};
+    max-width: 80%;
+    ${isUser ? 'margin-left: auto;' : 'margin-right: auto;'}
+    white-space: pre-wrap;
+    word-wrap: break-word;
+
+    pre {
+      background-color: ${theme.colorBgLayout};
+      padding: 12px;
+      border-radius: 4px;
+      overflow-x: auto;
+      margin: 8px 0;
+    }
+
+    code {
+      font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;
+      font-size: 13px;
+    }
+  `}
+`;
+
+const LoadingMessage = styled.div`
+  ${({ theme }) => `
+    padding: 24px;
+    margin-bottom: 16px;
+    border-radius: 8px;
+    background-color: ${theme.colorBgContainer};
+    max-width: 80%;
+    margin-right: auto;
+    color: ${theme.colorTextSecondary};
+    font-style: italic;
+  `}
+`;
+
+const DatasetSelectorContainer = styled.div`
+  margin-bottom: 32px;
+`;
+
+const StyledTextArea = styled(TextArea)`
+  margin-bottom: 16px;
+`;
+
+const DataPreviewContainer = styled.div`
+  ${({ theme }) => `
+    max-height: 500px;
+    overflow: auto;
+    border: 1px solid ${theme.colorBorder};
+    border-radius: 8px;
+    margin-top: 16px;
+  `}
+`;
+
+const ColumnTable = styled.table`
+  ${({ theme }) => `
+    width: 100%;
+    border-collapse: collapse;
+
+    th,
+    td {
+      padding: 12px;
+      text-align: left;
+      border-bottom: 1px solid ${theme.colorBorder};
+      white-space: nowrap;
+    }
+
+    th {
+      font-weight: 600;
+      background-color: ${theme.colorFillQuaternary};
+      position: sticky;
+      top: 0;
+      z-index: 1;
+    }
+
+    tr:hover {
+      background-color: ${theme.colorBgLayout};
+    }
+  `}
+`;
+
+const SQLCodeBlock = styled.pre`
+  ${({ theme }) => `
+    background-color: ${theme.colorBgLayout};
+    padding: 16px;
+    border-radius: 4px;
+    overflow-x: auto;
+    margin: 8px 0;
+    border: 1px solid ${theme.colorBorder};
+
+    code {
+      font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;
+      font-size: 13px;
+      line-height: 1.5;
+      color: ${theme.colorText};
+    }
+  `}
+`;
+
+interface Dataset {
+  id: number;
+  table_name: string;
+  database_name: string;
+  schema?: string;
+}
+
+interface ChatMessageType {
+  id: string;
+  text: string;
+  isUser: boolean;
+  timestamp: Date;
+  sql?: string; // Optional SQL code to display
+  queryResults?: {
+    columns: string[];
+    data: Array<Record<string, any>>;
+    rowCount: number;
+  };
+  isExecuting?: boolean;
+}
+
+interface DatasetData {
+  columns: Array<{ name: string; type: string; description?: string }>;
+  data: Array<Record<string, any>>;
+}
+
+export default function AIAssistant() {
+  const theme = useTheme();
+  const [datasets, setDatasets] = useState<Dataset[]>([]);
+  const [selectedDataset, setSelectedDataset] = useState<number | null>(null);
+  const [datasetData, setDatasetData] = useState<DatasetData | null>(null);
+  const [loadingDatasets, setLoadingDatasets] = useState(false);
+  const [loadingData, setLoadingData] = useState(false);
+  const [loadingAI, setLoadingAI] = useState(false);
+  const [messages, setMessages] = useState<ChatMessageType[]>([]);
+  const [inputValue, setInputValue] = useState('');
+
+  // Fetch datasets on mount
+  useEffect(() => {
+    fetchDatasets();
+  }, []);
+
+  // Fetch dataset data when selection changes
+  useEffect(() => {
+    if (selectedDataset) {
+      fetchDatasetData(selectedDataset);
+    } else {
+      setDatasetData(null);
+    }
+  }, [selectedDataset]);
+
+  const fetchDatasets = async () => {
+    setLoadingDatasets(true);
+    try {
+      const response = await SupersetClient.get({
+        endpoint: '/api/v1/dataset/',
+        searchParams: {
+          q: JSON.stringify({
+            page_size: 100,
+            order_column: 'table_name',
+            order_direction: 'asc',
+          }),
+        },
+      });
+
+      const datasetsData = response.json.result.map((ds: any) => ({
+        id: ds.id,
+        table_name: ds.table_name,
+        database_name: ds.database?.database_name || 'Unknown',
+        schema: ds.schema,
+      }));
+
+      setDatasets(datasetsData);
+    } catch (error) {
+      console.error('Error fetching datasets:', error);
+    } finally {
+      setLoadingDatasets(false);
+    }
+  };
+
+  const fetchDatasetData = async (datasetId: number) => {
+    setLoadingData(true);
+    try {
+      // First, fetch dataset metadata to get column information
+      const datasetResponse = await SupersetClient.get({
+        endpoint: `/api/v1/dataset/${datasetId}`,
+      });
+
+      const dataset = datasetResponse.json.result;
+
+      // Get column information including descriptions
+      const columns =
+        dataset.columns?.map((col: any) => ({
+          name: col.column_name,
+          type: col.type,
+          description: col.description,
+        })) || [];
+
+      // Log column info for debugging
+      console.log('Dataset columns with descriptions:', columns);
+
+      // Now fetch sample data using the chart data API
+      const queryPayload = {
+        datasource: {
+          id: datasetId,
+          type: 'table',
+        },
+        queries: [
+          {
+            columns: columns.map((col: { name: string }) => col.name),
+            row_limit: 50,
+            orderby: [],
+          },
+        ],
+        result_format: 'json',
+        result_type: 'full',
+      };
+
+      const dataResponse = await SupersetClient.post({
+        endpoint: '/api/v1/chart/data',
+        jsonPayload: queryPayload,
+      });
+
+      // Extract the data from the response
+      const result = dataResponse.json.result?.[0];
+      const data = result?.data || [];
+
+      setDatasetData({
+        columns,
+        data,
+      });
+    } catch (error) {
+      console.error('Error fetching dataset data:', error);
+      // Still show columns even if data fetch fails
+      const datasetResponse = await SupersetClient.get({
+        endpoint: `/api/v1/dataset/${datasetId}`,
+      });
+      const dataset = datasetResponse.json.result;
+      const columns =
+        dataset.columns?.map((col: any) => ({
+          name: col.column_name,
+          type: col.type,
+          description: col.description,
+        })) || [];
+      setDatasetData({
+        columns,
+        data: [],
+      });
+    } finally {
+      setLoadingData(false);
+    }
+  };
+
+  // Build schema information for AI context
+  const buildSchemaInfo = (): string => {
+    if (!selectedDataset || !datasetData) return '';
+
+    const datasetName =
+      datasets.find(d => d.id === selectedDataset)?.table_name || 'Unknown';
+    let schemaInfo = `The dataset "${datasetName}" has the following 
columns:\n\n`;
+
+    datasetData.columns.forEach(col => {
+      schemaInfo += `- ${col.name} (${col.type})`;
+      if (col.description) {
+        schemaInfo += `: ${col.description}`;
+      }
+      schemaInfo += '\n';
+    });
+
+    schemaInfo +=
+      '\nGenerate a valid SQL SELECT query using ONLY these column names.\n';
+    schemaInfo +=
+      'DO NOT explain your reasoning, and DO NOT return anything other than 
the SQL query itself.';
+
+    return schemaInfo;
+  };
+
+  const handleSendMessage = async () => {
+    if (!inputValue.trim()) return;
+
+    if (!selectedDataset) {
+      // Show error if no dataset selected
+      const errorMessage: ChatMessageType = {
+        id: Date.now().toString(),
+        text: 'Please select a dataset first before asking questions.',
+        isUser: false,
+        timestamp: new Date(),
+      };
+      setMessages(prev => [...prev, errorMessage]);
+      return;
+    }
+
+    const newMessage: ChatMessageType = {
+      id: Date.now().toString(),
+      text: inputValue,
+      isUser: true,
+      timestamp: new Date(),
+    };
+
+    setMessages(prev => [...prev, newMessage]);
+    setInputValue('');
+    setLoadingAI(true);
+
+    try {
+      // Call the AI SQL generation API
+      const response = await SupersetClient.post({
+        endpoint: '/api/v1/aiassistant/generate_sql',
+        jsonPayload: {
+          dataset_id: selectedDataset,
+          user_query: inputValue,
+        },

Review Comment:
   
   <div>
   
   
   <div id="suggestion">
   <div id="issue"><b>Incomplete API payload for SQL generation</b></div>
   <div id="fix">
   
   The API call to generate SQL is missing the required 'schema_info' parameter 
that the backend expects. This will cause the request to fail with a 400 error 
indicating missing required fields. The backend endpoint requires both 
'user_query' and 'schema_info' to generate SQL, but the frontend is only 
sending 'dataset_id' and 'user_query'. Add the schema_info built from the 
selected dataset's metadata to the jsonPayload.
   </div>
   
   <details>
   <summary>
   <b>Code suggestion</b>
   </summary>
   <blockquote>Check the AI-generated fix before applying</blockquote>
   <div id="code">
   
   
   ````suggestion
           endpoint: '/api/v1/aiassistant/generate_sql',
           jsonPayload: {
             dataset_id: selectedDataset,
             user_query: inputValue,
             schema_info: buildSchemaInfo(),
           },
   ````
   
   </div>
   </details>
   
   
   </div>
   
   
   
   
   <small><i>Code Review Run <a 
href=https://github.com/apache/superset/pull/36187#issuecomment-3552782654>#565ab4</a></i></small>
   </div>
   
   ---
   Should Bito avoid suggestions like this for future reviews? (<a 
href=https://alpha.bito.ai/home/ai-agents/review-rules>Manage Rules</a>)
   - [ ] Yes, avoid them



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat(datasets): ai assistant for queries [superset]

Reply via email to