ottensjors commented on issue #30208: URL: https://github.com/apache/superset/issues/30208#issuecomment-4563797502
Hi all, I ran into what appears to be the same issue when connecting Apache Superset to Spark Thrift Server through a `hive://` URI, using the Hive database connection in Superset. In my setup, normal SQL queries work correctly, but the SQL Lab left-hand metadata browser does not reflect schemas, tables and columns correctly for Spark/Delta tables. The error in this issue: ```sql DESCRIBE srdm_edm_sit.srdm_edm_sit ``` matches one of the root cause patterns I found. Superset/PyHive can end up treating the selected schema as both the schema and the table name, which leads to invalid Spark SQL metadata queries. I tested a local patch in a custom Superset Docker image and the SQL Lab sidebar started working correctly afterwards. ## Environment - Superset: `apache/superset:latest` at the time of testing, reporting Superset `6.1.0` - Connection URI: `hive://spark-thrift-server:10000/default` - Backend: Spark Thrift Server - Spark: `3.5.3` - Delta Lake: `3.2.1` - Hive Metastore: external Hive Metastore `3.1.0` - Object storage: MinIO / S3-compatible storage - Driver path: Superset -> SQLAlchemy -> PyHive -> Spark Thrift Server -> Hive Metastore / MinIO ## Observed behavior When browsing schemas and tables in SQL Lab, Superset can generate incorrect metadata queries. For example, when selecting a schema, Superset may generate something like: ```sql DESCRIBE <schema>.<schema> ``` instead of: ```sql DESCRIBE <schema>.<table> ``` In my case this resulted in one or more of the following: - schema names shown as table names - repeated schema names in the left-hand SQL Lab sidebar - incorrect or missing table metadata - errors when opening column metadata for Delta tables - partition metadata errors for Delta tables Normal SQL queries against the same tables work correctly. The issue appears to be in metadata reflection, not in Spark query execution. ## Expected behavior When using Spark Thrift Server through `hive://`, SQL Lab should: - show the actual schemas - show the actual tables/views inside the selected schema - show the correct columns for Spark/Delta tables - avoid generating invalid `DESCRIBE <schema>.<schema>` metadata queries - avoid running Hive-specific metadata commands that Spark/Delta does not support ## Screenshots ### 1. Spark Thrift Server connection used in Superset This is the Superset database connection used to reproduce the issue. ```text hive://spark-thrift-server:10000/default ``` <img width="496" height="851" alt="Superset Spark Thrift Server connection using hive URI" src="https://github.com/user-attachments/assets/a08a110f-ed78-4bfd-9ba9-b08a846fafae" /> ### 2. Before patch: incorrect schema/table metadata in SQL Lab Before applying the patch, the SQL Lab sidebar shows incorrect metadata. In my case, schema names were shown as table names and Superset could generate invalid metadata queries such as: ```sql DESCRIBE <schema>.<schema> ``` <img width="1227" height="723" alt="Superset SQL Lab incorrect Spark Thrift Server metadata before patch" src="https://github.com/user-attachments/assets/c666b5a9-c724-4a48-a4e9-3fbb5c78a596" /> ### 3. After patch: correct schema/table/column metadata in SQL Lab After applying the patch, the SQL Lab sidebar correctly shows the schemas, tables and columns for Spark/Delta tables through Spark Thrift Server. <img width="1227" height="723" alt="Superset SQL Lab correct Spark Thrift Server metadata after patch" src="https://github.com/user-attachments/assets/1da6008e-067f-4552-a261-bb14c7147fbb" /> ## Root causes I found I found multiple related metadata reflection issues. ### 1. Superset Hive engine spec rewrites the connection URI database during schema browsing Superset's Hive engine spec appears to rewrite the connection URI database to the selected schema when browsing metadata. With Spark Thrift Server this can cause Superset to mix up catalog, schema and table names. For example, selecting schema `gold` can lead to metadata queries where `gold` is treated as both the schema and the table name. This matches the kind of query shown in this issue: ```sql DESCRIBE srdm_edm_sit.srdm_edm_sit ``` In my local test, removing this database rewrite prevented Superset from building invalid `DESCRIBE` / `SHOW TABLES` queries. ### 2. PyHive assumes `SHOW TABLES` returns a one-column Hive-style result PyHive's default `HiveDialect.get_table_names()` assumes `SHOW TABLES` returns a single-column result and reads: ```python row[0] ``` However, Spark Thrift Server returns `SHOW TABLES` as: ```text namespace | tableName | isTemporary ``` Example: ```text gold | orders | false gold | customers | false ``` In this result shape: - `row[0]` is the namespace/schema - `row[1]` is the actual table name - `row[2]` indicates whether the object is temporary Because PyHive reads `row[0]`, Superset can display repeated schema names instead of actual table names. A Spark-aware implementation should handle both result shapes: ```python if len(row) == 1: table_name = row[0] elif len(row) >= 3: table_name = row[1] is_temporary = row[2] ``` Temporary objects can optionally be skipped when `isTemporary` is true. ### 3. PyHive column reflection is not defensive enough for Spark/Delta `DESCRIBE` output Superset fetches table metadata through SQLAlchemy's `get_columns()` flow. PyHive implements this by running `DESCRIBE` and parsing the result as if it only contains real table columns. Spark/Delta can return extra metadata sections in `DESCRIBE` output, for example: ```text # Partition Information # col_name # Detailed Table Information ``` These rows are not real table columns. If PyHive tries to parse them as normal columns, the SQL Lab sidebar can fail when opening table metadata. The parser should skip or stop at metadata/header rows such as: ```text # col_name # Partition Information # Detailed Table Information ``` It should also handle malformed or unknown types defensively instead of failing hard. ### 4. Superset's Hive partition lookup can fail for Spark/Delta tables Superset's Hive engine spec can generate Hive-style partition metadata queries such as: ```sql SHOW PARTITIONS <schema>.<table> ``` Spark/Delta can reject this with an error similar to: ```text INVALID_PARTITION_OPERATION.PARTITION_MANAGEMENT_IS_UNSUPPORTED ``` This affects Superset metadata browsing only. It does not affect normal `SELECT` queries or Spark/Delta partition pruning. In my local workaround I disabled this metadata lookup path for Spark/Delta by returning an empty result instead of calling `SHOW PARTITIONS`. A cleaner upstream fix may be to guard this behavior behind a Spark/Delta-specific capability flag instead of changing generic Hive behavior globally. ## Local workaround that fixed the issue I tested four local patches in a custom Superset Docker image: 1. Removed the Hive engine spec schema/database URI rewrite during metadata browsing. 2. Fixed PyHive `SHOW TABLES` parsing for Spark's `namespace | tableName | isTemporary` result shape. 3. Made PyHive `DESCRIBE` parsing skip Spark/Delta metadata rows. 4. Disabled Hive-style `SHOW PARTITIONS` metadata lookup for Spark/Delta tables. After these patches, SQL Lab reflected schemas, tables and columns correctly. This is not necessarily the exact upstream implementation I am proposing, but it confirms the root causes. ### Patch summary #### Patch 1: Prevent Hive engine spec from rewriting the connection URI database Superset's Hive/Spark engine spec rewrites the connection URI database to the selected schema when browsing metadata. With Spark Thrift Server, this can cause Superset to mix up catalog, schema and table names. Removing this rewrite keeps the original connection database intact and prevents Superset from building invalid `DESCRIBE` / `SHOW TABLES` queries. #### Patch 2: Handle Spark Thrift Server `SHOW TABLES` result shape PyHive's default `HiveDialect.get_table_names()` assumes `SHOW TABLES` returns a single-column result and reads `row[0]`. Spark Thrift Server returns: ```text namespace | tableName | isTemporary ``` The workaround keeps compatibility with normal Hive-style one-column results, while correctly handling Spark's three-column `SHOW TABLES` output. #### Patch 3: Make PyHive column reflection safe for Spark/Delta `DESCRIBE` output Spark Thrift Server with Delta tables can return extra metadata sections in `DESCRIBE` output: ```text # Partition Information # col_name # Detailed Table Information ``` The workaround skips empty rows, skips `# col_name` header rows, stops before partition/table metadata sections, and handles unknown or malformed types defensively. #### Patch 4: Disable Hive-style partition lookup for Spark/Delta metadata browsing Superset's Hive engine spec builds partition metadata queries using: ```sql SHOW PARTITIONS <schema.table> ``` Spark Thrift Server with Delta tables can reject this command with: ```text INVALID_PARTITION_OPERATION.PARTITION_MANAGEMENT_IS_UNSUPPORTED ``` The workaround returns an empty result for this metadata helper path. This does not affect Spark/Delta partition pruning for normal `SELECT` queries. ## Full Dockerfile used for local validation <details> <summary><span style="color:red;">🔴 Click to expand full Dockerfile</span></summary> ```dockerfile FROM apache/superset:latest USER root RUN apt-get update && apt-get install -y \ gcc \ g++ \ libsasl2-dev \ python3-dev \ && rm -rf /var/lib/apt/lists/* RUN /app/.venv/bin/python -m ensurepip --upgrade || true && \ /app/.venv/bin/python -m pip install --no-cache-dir \ psycopg2-binary \ trino[sqlalchemy] \ pyhive \ thrift \ thrift-sasl \ sasl \ pure-sasl # Patch 1: Superset Hive engine spec # # Superset's Hive/Spark engine spec rewrites the connection URI database # to the selected schema when browsing metadata. # # With Spark Thrift Server, this can cause Superset to mix up catalog, # schema and table names. For example, selecting schema "gold" can make # Superset generate metadata queries where "gold" is treated as both the # database/schema and the table name. # # Removing this rewrite keeps the original connection database intact and # prevents Superset from building invalid DESCRIBE / SHOW TABLES queries. RUN sed -i.bak "/if schema:/,+1d" /app/superset/db_engine_specs/hive.py # Patch 2: PyHive table reflection for Spark Thrift Server # # PyHive's default HiveDialect.get_table_names() assumes SHOW TABLES returns # a single-column result and reads row[0]. # # Spark Thrift Server returns SHOW TABLES as: # namespace | tableName | isTemporary # # In that result shape, row[0] is the namespace, for example "gold" or "silver", # and row[1] is the actual table/view name. # # Without this patch, Superset's metadata browser shows repeated schema names # instead of the actual tables/views inside bronze, silver or gold. # # This patch keeps compatibility with normal Hive-style one-column results, # while correctly handling Spark's three-column SHOW TABLES output. RUN python - <<'PY' from pathlib import Path import pyhive.sqlalchemy_hive path = Path(pyhive.sqlalchemy_hive.__file__) text = path.read_text() old = ''' def get_table_names(self, connection, schema=None, **kw): query = 'SHOW TABLES' if schema: query += ' IN ' + self.identifier_preparer.quote_identifier(schema) return [row[0] for row in connection.execute(text(query))] ''' new = ''' def get_table_names(self, connection, schema=None, **kw): query = 'SHOW TABLES' if schema: query += ' IN ' + self.identifier_preparer.quote_identifier(schema) table_names = [] for row in connection.execute(text(query)): if len(row) == 1: table_names.append(row[0]) continue if len(row) >= 3: table_name = row[1] is_temporary = row[2] if str(is_temporary).lower() in ("true", "1"): continue table_names.append(table_name) continue if len(row) >= 2: table_names.append(row[1]) return table_names ''' if old not in text: raise RuntimeError("Expected PyHive get_table_names block not found") path.write_text(text.replace(old, new)) PY # Patch 3: Make PyHive column reflection safe for Spark/Delta DESCRIBE output. # # Superset fetches table metadata through SQLAlchemy's get_columns(). # PyHive implements this by running DESCRIBE against the table and parsing the # result as if it only contains real columns. # # Spark Thrift Server with Delta tables can return extra metadata sections: # # Partition Information # # col_name # # Detailed Table Information # # These rows are not actual columns. If PyHive tries to parse them as columns, # Superset can fail when opening table metadata in the SQL Lab sidebar. # # This patch: # - skips empty rows # - skips '# col_name' header rows # - stops before partition/table metadata sections # - handles unknown or malformed types defensively RUN python - <<'PY' from pathlib import Path import pyhive.sqlalchemy_hive path = Path(pyhive.sqlalchemy_hive.__file__) text = path.read_text() old = ''' def get_columns(self, connection, table_name, schema=None, **kw): rows = self._get_table_columns(connection, table_name, schema) # Strip whitespace rows = [[col.strip() if col else None for col in row] for row in rows] # Filter out empty rows and comment rows = [row for row in rows if row[0] and row[0] != '# col_name'] result = [] for (col_name, col_type, _comment) in rows: if col_name == '# Partition Information': break # Take out the more detailed type information # e.g. 'map<int,int>' -> 'map' # 'decimal(10,1)' -> decimal col_type = re.search(r'^\\w+', col_type).group(0) try: coltype = _type_map[col_type] except KeyError: util.warn("Did not recognize type '%s' of column '%s'" % (col_type, col_name)) coltype = types.NullType result.append({ 'name': col_name, 'type': coltype, 'nullable': True, 'default': None, }) return result ''' new = ''' def get_columns(self, connection, table_name, schema=None, **kw): rows = self._get_table_columns(connection, table_name, schema) cleaned_rows = [] for row in rows: row = list(row) if len(row) < 2: continue col_name = row[0].strip() if row[0] else None col_type = row[1].strip() if row[1] else None comment = row[2].strip() if len(row) > 2 and row[2] else None if not col_name: continue if col_name == '# col_name': continue if col_name in ('# Partition Information', '# Detailed Table Information'): break if not col_type: continue cleaned_rows.append((col_name, col_type, comment)) result = [] for col_name, col_type, _comment in cleaned_rows: # Take out the more detailed type information # e.g. 'map<int,int>' -> 'map' # 'decimal(10,1)' -> 'decimal' match = re.search(r'^\\w+', col_type) if not match: util.warn("Could not parse type '%s' of column '%s'" % (col_type, col_name)) coltype = types.NullType else: col_type = match.group(0).lower() try: coltype = _type_map[col_type] except KeyError: util.warn("Did not recognize type '%s' of column '%s'" % (col_type, col_name)) coltype = types.NullType result.append({ 'name': col_name, 'type': coltype, 'nullable': True, 'default': None, }) return result ''' if old not in text: raise RuntimeError("Expected PyHive get_columns block not found") path.write_text(text.replace(old, new)) PY # Patch 4: Disable Hive-style partition lookup for Spark/Delta tables. # # Superset's Hive engine spec builds partition metadata queries using: # SHOW PARTITIONS <schema.table> # # Spark Thrift Server with Delta tables can reject this command with: # INVALID_PARTITION_OPERATION.PARTITION_MANAGEMENT_IS_UNSUPPORTED # # This does not affect Spark/Delta partition pruning for normal SELECT queries. # It only disables Superset's metadata lookup / latest-partition helper path. # # Returning an empty SELECT keeps Superset's metadata flow alive without trying # to inspect partitions through unsupported Hive commands. RUN python - <<'PY' from pathlib import Path path = Path("/app/superset/db_engine_specs/hive.py") text = path.read_text() old = ''' @classmethod def _partition_query( # pylint: disable=all cls, table: Table, indexes: list[dict[str, Any]], database: Database, limit: int = 0, order_by: list[tuple[str, bool]] | None = None, filters: dict[Any, Any] | None = None, ) -> str: full_table_name = ( f"{table.schema}.{table.table}" if table.schema else table.table ) return f"SHOW PARTITIONS {full_table_name}" ''' new = ''' @classmethod def _partition_query( # pylint: disable=all cls, table: Table, indexes: list[dict[str, Any]], database: Database, limit: int = 0, order_by: list[tuple[str, bool]] | None = None, filters: dict[Any, Any] | None = None, ) -> str: return "SELECT NULL WHERE 1 = 0" ''' if old not in text: raise RuntimeError("Expected HiveEngineSpec._partition_query block not found") path.write_text(text.replace(old, new)) PY USER superset ``` </details> ## Result after the workaround After applying these patches: - SQL Lab showed the correct schemas - selecting a schema showed the actual tables/views - table metadata opened correctly - Delta table columns were reflected correctly - the sidebar no longer generated invalid `DESCRIBE <schema>.<schema>` queries - partition metadata lookup no longer broke the metadata browsing flow ## Suggested upstream direction I am happy to help with a PR, but I would like guidance on the preferred direction. Possible options that i could think of to resolve this issue: 1. Add a dedicated Spark Thrift Server engine spec instead of treating Spark Thrift Server exactly like classic Hive. 2. Add Spark Thrift Server-aware logic in Superset's existing Hive/SparkSQL metadata reflection path. 3. Contribute the generic `SHOW TABLES` and `DESCRIBE` reflection fixes upstream to PyHive. 4. Add a Superset-side guard or capability flag to avoid Hive-only metadata commands such as `SHOW PARTITIONS` for Spark/Delta tables. My preference would be to start with a small, low-risk fix for the `SHOW TABLES` result shape, because that directly explains why schema names are shown as table names and why invalid `DESCRIBE <schema>.<schema>` queries can be generated. Then the Spark/Delta `DESCRIBE` and partition metadata behavior could be handled in follow-up PRs. Would the maintainers prefer this fix to live in Superset, in PyHive, or as a dedicated Spark Thrift Server engine spec? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
