ottensjors commented on issue #30208:
URL: https://github.com/apache/superset/issues/30208#issuecomment-4563797502

   Hi all,
   
   I ran into what appears to be the same issue when connecting Apache Superset 
to Spark Thrift Server through a `hive://` URI, using the Hive database 
connection in Superset.
   
   In my setup, normal SQL queries work correctly, but the SQL Lab left-hand 
metadata browser does not reflect schemas, tables and columns correctly for 
Spark/Delta tables.
   
   The error in this issue:
   
   ```sql
   DESCRIBE srdm_edm_sit.srdm_edm_sit
   ```
   
   matches one of the root cause patterns I found. Superset/PyHive can end up 
treating the selected schema as both the schema and the table name, which leads 
to invalid Spark SQL metadata queries.
   
   I tested a local patch in a custom Superset Docker image and the SQL Lab 
sidebar started working correctly afterwards.
   
   ## Environment
   
   - Superset: `apache/superset:latest` at the time of testing, reporting 
Superset `6.1.0`
   - Connection URI: `hive://spark-thrift-server:10000/default`
   - Backend: Spark Thrift Server
   - Spark: `3.5.3`
   - Delta Lake: `3.2.1`
   - Hive Metastore: external Hive Metastore `3.1.0`
   - Object storage: MinIO / S3-compatible storage
   - Driver path: Superset -> SQLAlchemy -> PyHive -> Spark Thrift Server -> 
Hive Metastore / MinIO
   
   ## Observed behavior
   
   When browsing schemas and tables in SQL Lab, Superset can generate incorrect 
metadata queries.
   
   For example, when selecting a schema, Superset may generate something like:
   
   ```sql
   DESCRIBE <schema>.<schema>
   ```
   
   instead of:
   
   ```sql
   DESCRIBE <schema>.<table>
   ```
   
   In my case this resulted in one or more of the following:
   
   - schema names shown as table names
   - repeated schema names in the left-hand SQL Lab sidebar
   - incorrect or missing table metadata
   - errors when opening column metadata for Delta tables
   - partition metadata errors for Delta tables
   
   Normal SQL queries against the same tables work correctly. The issue appears 
to be in metadata reflection, not in Spark query execution.
   
   ## Expected behavior
   
   When using Spark Thrift Server through `hive://`, SQL Lab should:
   
   - show the actual schemas
   - show the actual tables/views inside the selected schema
   - show the correct columns for Spark/Delta tables
   - avoid generating invalid `DESCRIBE <schema>.<schema>` metadata queries
   - avoid running Hive-specific metadata commands that Spark/Delta does not 
support
   
   ## Screenshots
   
   ### 1. Spark Thrift Server connection used in Superset
   
   This is the Superset database connection used to reproduce the issue.
   
   ```text
   hive://spark-thrift-server:10000/default
   ```
   
   <img width="496" height="851" alt="Superset Spark Thrift Server connection 
using hive URI" 
src="https://github.com/user-attachments/assets/a08a110f-ed78-4bfd-9ba9-b08a846fafae";
 />
   
   ### 2. Before patch: incorrect schema/table metadata in SQL Lab
   
   Before applying the patch, the SQL Lab sidebar shows incorrect metadata. In 
my case, schema names were shown as table names and Superset could generate 
invalid metadata queries such as:
   
   ```sql
   DESCRIBE <schema>.<schema>
   ```
   
   <img width="1227" height="723" alt="Superset SQL Lab incorrect Spark Thrift 
Server metadata before patch" 
src="https://github.com/user-attachments/assets/c666b5a9-c724-4a48-a4e9-3fbb5c78a596";
 />
   
   ### 3. After patch: correct schema/table/column metadata in SQL Lab
   
   After applying the patch, the SQL Lab sidebar correctly shows the schemas, 
tables and columns for Spark/Delta tables through Spark Thrift Server.
   
   <img width="1227" height="723" alt="Superset SQL Lab correct Spark Thrift 
Server metadata after patch" 
src="https://github.com/user-attachments/assets/1da6008e-067f-4552-a261-bb14c7147fbb";
 />
   
   ## Root causes I found
   
   I found multiple related metadata reflection issues.
   
   ### 1. Superset Hive engine spec rewrites the connection URI database during 
schema browsing
   
   Superset's Hive engine spec appears to rewrite the connection URI database 
to the selected schema when browsing metadata.
   
   With Spark Thrift Server this can cause Superset to mix up catalog, schema 
and table names.
   
   For example, selecting schema `gold` can lead to metadata queries where 
`gold` is treated as both the schema and the table name.
   
   This matches the kind of query shown in this issue:
   
   ```sql
   DESCRIBE srdm_edm_sit.srdm_edm_sit
   ```
   
   In my local test, removing this database rewrite prevented Superset from 
building invalid `DESCRIBE` / `SHOW TABLES` queries.
   
   ### 2. PyHive assumes `SHOW TABLES` returns a one-column Hive-style result
   
   PyHive's default `HiveDialect.get_table_names()` assumes `SHOW TABLES` 
returns a single-column result and reads:
   
   ```python
   row[0]
   ```
   
   However, Spark Thrift Server returns `SHOW TABLES` as:
   
   ```text
   namespace | tableName | isTemporary
   ```
   
   Example:
   
   ```text
   gold | orders    | false
   gold | customers | false
   ```
   
   In this result shape:
   
   - `row[0]` is the namespace/schema
   - `row[1]` is the actual table name
   - `row[2]` indicates whether the object is temporary
   
   Because PyHive reads `row[0]`, Superset can display repeated schema names 
instead of actual table names.
   
   A Spark-aware implementation should handle both result shapes:
   
   ```python
   if len(row) == 1:
       table_name = row[0]
   elif len(row) >= 3:
       table_name = row[1]
       is_temporary = row[2]
   ```
   
   Temporary objects can optionally be skipped when `isTemporary` is true.
   
   ### 3. PyHive column reflection is not defensive enough for Spark/Delta 
`DESCRIBE` output
   
   Superset fetches table metadata through SQLAlchemy's `get_columns()` flow.
   
   PyHive implements this by running `DESCRIBE` and parsing the result as if it 
only contains real table columns.
   
   Spark/Delta can return extra metadata sections in `DESCRIBE` output, for 
example:
   
   ```text
   # Partition Information
   # col_name
   # Detailed Table Information
   ```
   
   These rows are not real table columns.
   
   If PyHive tries to parse them as normal columns, the SQL Lab sidebar can 
fail when opening table metadata.
   
   The parser should skip or stop at metadata/header rows such as:
   
   ```text
   # col_name
   # Partition Information
   # Detailed Table Information
   ```
   
   It should also handle malformed or unknown types defensively instead of 
failing hard.
   
   ### 4. Superset's Hive partition lookup can fail for Spark/Delta tables
   
   Superset's Hive engine spec can generate Hive-style partition metadata 
queries such as:
   
   ```sql
   SHOW PARTITIONS <schema>.<table>
   ```
   
   Spark/Delta can reject this with an error similar to:
   
   ```text
   INVALID_PARTITION_OPERATION.PARTITION_MANAGEMENT_IS_UNSUPPORTED
   ```
   
   This affects Superset metadata browsing only.
   
   It does not affect normal `SELECT` queries or Spark/Delta partition pruning.
   
   In my local workaround I disabled this metadata lookup path for Spark/Delta 
by returning an empty result instead of calling `SHOW PARTITIONS`.
   
   A cleaner upstream fix may be to guard this behavior behind a 
Spark/Delta-specific capability flag instead of changing generic Hive behavior 
globally.
   
   ## Local workaround that fixed the issue
   
   I tested four local patches in a custom Superset Docker image:
   
   1. Removed the Hive engine spec schema/database URI rewrite during metadata 
browsing.
   2. Fixed PyHive `SHOW TABLES` parsing for Spark's `namespace | tableName | 
isTemporary` result shape.
   3. Made PyHive `DESCRIBE` parsing skip Spark/Delta metadata rows.
   4. Disabled Hive-style `SHOW PARTITIONS` metadata lookup for Spark/Delta 
tables.
   
   After these patches, SQL Lab reflected schemas, tables and columns correctly.
   
   This is not necessarily the exact upstream implementation I am proposing, 
but it confirms the root causes.
   
   ### Patch summary
   
   #### Patch 1: Prevent Hive engine spec from rewriting the connection URI 
database
   
   Superset's Hive/Spark engine spec rewrites the connection URI database to 
the selected schema when browsing metadata.
   
   With Spark Thrift Server, this can cause Superset to mix up catalog, schema 
and table names. Removing this rewrite keeps the original connection database 
intact and prevents Superset from building invalid `DESCRIBE` / `SHOW TABLES` 
queries.
   
   #### Patch 2: Handle Spark Thrift Server `SHOW TABLES` result shape
   
   PyHive's default `HiveDialect.get_table_names()` assumes `SHOW TABLES` 
returns a single-column result and reads `row[0]`.
   
   Spark Thrift Server returns:
   
   ```text
   namespace | tableName | isTemporary
   ```
   
   The workaround keeps compatibility with normal Hive-style one-column 
results, while correctly handling Spark's three-column `SHOW TABLES` output.
   
   #### Patch 3: Make PyHive column reflection safe for Spark/Delta `DESCRIBE` 
output
   
   Spark Thrift Server with Delta tables can return extra metadata sections in 
`DESCRIBE` output:
   
   ```text
   # Partition Information
   # col_name
   # Detailed Table Information
   ```
   
   The workaround skips empty rows, skips `# col_name` header rows, stops 
before partition/table metadata sections, and handles unknown or malformed 
types defensively.
   
   #### Patch 4: Disable Hive-style partition lookup for Spark/Delta metadata 
browsing
   
   Superset's Hive engine spec builds partition metadata queries using:
   
   ```sql
   SHOW PARTITIONS <schema.table>
   ```
   
   Spark Thrift Server with Delta tables can reject this command with:
   
   ```text
   INVALID_PARTITION_OPERATION.PARTITION_MANAGEMENT_IS_UNSUPPORTED
   ```
   
   The workaround returns an empty result for this metadata helper path. This 
does not affect Spark/Delta partition pruning for normal `SELECT` queries.
   
   ## Full Dockerfile used for local validation
   
   <details>
   <summary><span style="color:red;">🔴 Click to expand full 
Dockerfile</span></summary>
   
   ```dockerfile
   FROM apache/superset:latest
   
   USER root
   
   RUN apt-get update && apt-get install -y \
       gcc \
       g++ \
       libsasl2-dev \
       python3-dev \
       && rm -rf /var/lib/apt/lists/*
   
   RUN /app/.venv/bin/python -m ensurepip --upgrade || true && \
       /app/.venv/bin/python -m pip install --no-cache-dir \
         psycopg2-binary \
         trino[sqlalchemy] \
         pyhive \
         thrift \
         thrift-sasl \
         sasl \
         pure-sasl
   
   # Patch 1: Superset Hive engine spec
   #
   # Superset's Hive/Spark engine spec rewrites the connection URI database
   # to the selected schema when browsing metadata.
   #
   # With Spark Thrift Server, this can cause Superset to mix up catalog,
   # schema and table names. For example, selecting schema "gold" can make
   # Superset generate metadata queries where "gold" is treated as both the
   # database/schema and the table name.
   #
   # Removing this rewrite keeps the original connection database intact and
   # prevents Superset from building invalid DESCRIBE / SHOW TABLES queries.
   RUN sed -i.bak "/if schema:/,+1d" /app/superset/db_engine_specs/hive.py
   
   
   # Patch 2: PyHive table reflection for Spark Thrift Server
   #
   # PyHive's default HiveDialect.get_table_names() assumes SHOW TABLES returns
   # a single-column result and reads row[0].
   #
   # Spark Thrift Server returns SHOW TABLES as:
   #   namespace | tableName | isTemporary
   #
   # In that result shape, row[0] is the namespace, for example "gold" or 
"silver",
   # and row[1] is the actual table/view name.
   #
   # Without this patch, Superset's metadata browser shows repeated schema names
   # instead of the actual tables/views inside bronze, silver or gold.
   #
   # This patch keeps compatibility with normal Hive-style one-column results,
   # while correctly handling Spark's three-column SHOW TABLES output.
   RUN python - <<'PY'
   from pathlib import Path
   import pyhive.sqlalchemy_hive
   
   path = Path(pyhive.sqlalchemy_hive.__file__)
   text = path.read_text()
   
   old = '''    def get_table_names(self, connection, schema=None, **kw):
           query = 'SHOW TABLES'
           if schema:
               query += ' IN ' + 
self.identifier_preparer.quote_identifier(schema)
           return [row[0] for row in connection.execute(text(query))]
   '''
   
   new = '''    def get_table_names(self, connection, schema=None, **kw):
           query = 'SHOW TABLES'
           if schema:
               query += ' IN ' + 
self.identifier_preparer.quote_identifier(schema)
   
           table_names = []
   
           for row in connection.execute(text(query)):
               if len(row) == 1:
                   table_names.append(row[0])
                   continue
   
               if len(row) >= 3:
                   table_name = row[1]
                   is_temporary = row[2]
   
                   if str(is_temporary).lower() in ("true", "1"):
                       continue
   
                   table_names.append(table_name)
                   continue
   
               if len(row) >= 2:
                   table_names.append(row[1])
   
           return table_names
   '''
   
   if old not in text:
       raise RuntimeError("Expected PyHive get_table_names block not found")
   
   path.write_text(text.replace(old, new))
   PY
   
   # Patch 3: Make PyHive column reflection safe for Spark/Delta DESCRIBE 
output.
   #
   # Superset fetches table metadata through SQLAlchemy's get_columns().
   # PyHive implements this by running DESCRIBE against the table and parsing 
the
   # result as if it only contains real columns.
   #
   # Spark Thrift Server with Delta tables can return extra metadata sections:
   #   # Partition Information
   #   # col_name
   #   # Detailed Table Information
   #
   # These rows are not actual columns. If PyHive tries to parse them as 
columns,
   # Superset can fail when opening table metadata in the SQL Lab sidebar.
   #
   # This patch:
   # - skips empty rows
   # - skips '# col_name' header rows
   # - stops before partition/table metadata sections
   # - handles unknown or malformed types defensively
   RUN python - <<'PY'
   from pathlib import Path
   import pyhive.sqlalchemy_hive
   
   path = Path(pyhive.sqlalchemy_hive.__file__)
   text = path.read_text()
   
   old = '''    def get_columns(self, connection, table_name, schema=None, 
**kw):
           rows = self._get_table_columns(connection, table_name, schema)
           # Strip whitespace
           rows = [[col.strip() if col else None for col in row] for row in 
rows]
           # Filter out empty rows and comment
           rows = [row for row in rows if row[0] and row[0] != '# col_name']
           result = []
           for (col_name, col_type, _comment) in rows:
               if col_name == '# Partition Information':
                   break
               # Take out the more detailed type information
               # e.g. 'map<int,int>' -> 'map'
               #      'decimal(10,1)' -> decimal
               col_type = re.search(r'^\\w+', col_type).group(0)
               try:
                   coltype = _type_map[col_type]
               except KeyError:
                   util.warn("Did not recognize type '%s' of column '%s'" % 
(col_type, col_name))
                   coltype = types.NullType
   
               result.append({
                   'name': col_name,
                   'type': coltype,
                   'nullable': True,
                   'default': None,
               })
           return result
   '''
   
   new = '''    def get_columns(self, connection, table_name, schema=None, 
**kw):
           rows = self._get_table_columns(connection, table_name, schema)
   
           cleaned_rows = []
   
           for row in rows:
               row = list(row)
   
               if len(row) < 2:
                   continue
   
               col_name = row[0].strip() if row[0] else None
               col_type = row[1].strip() if row[1] else None
               comment = row[2].strip() if len(row) > 2 and row[2] else None
   
               if not col_name:
                   continue
   
               if col_name == '# col_name':
                   continue
   
               if col_name in ('# Partition Information', '# Detailed Table 
Information'):
                   break
   
               if not col_type:
                   continue
   
               cleaned_rows.append((col_name, col_type, comment))
   
           result = []
   
           for col_name, col_type, _comment in cleaned_rows:
               # Take out the more detailed type information
               # e.g. 'map<int,int>' -> 'map'
               #      'decimal(10,1)' -> 'decimal'
               match = re.search(r'^\\w+', col_type)
   
               if not match:
                   util.warn("Could not parse type '%s' of column '%s'" % 
(col_type, col_name))
                   coltype = types.NullType
               else:
                   col_type = match.group(0).lower()
   
                   try:
                       coltype = _type_map[col_type]
                   except KeyError:
                       util.warn("Did not recognize type '%s' of column '%s'" % 
(col_type, col_name))
                       coltype = types.NullType
   
               result.append({
                   'name': col_name,
                   'type': coltype,
                   'nullable': True,
                   'default': None,
               })
   
           return result
   '''
   
   if old not in text:
       raise RuntimeError("Expected PyHive get_columns block not found")
   
   path.write_text(text.replace(old, new))
   PY
   
   # Patch 4: Disable Hive-style partition lookup for Spark/Delta tables.
   #
   # Superset's Hive engine spec builds partition metadata queries using:
   #   SHOW PARTITIONS <schema.table>
   #
   # Spark Thrift Server with Delta tables can reject this command with:
   #   INVALID_PARTITION_OPERATION.PARTITION_MANAGEMENT_IS_UNSUPPORTED
   #
   # This does not affect Spark/Delta partition pruning for normal SELECT 
queries.
   # It only disables Superset's metadata lookup / latest-partition helper path.
   #
   # Returning an empty SELECT keeps Superset's metadata flow alive without 
trying
   # to inspect partitions through unsupported Hive commands.
   RUN python - <<'PY'
   from pathlib import Path
   
   path = Path("/app/superset/db_engine_specs/hive.py")
   text = path.read_text()
   
   old = '''    @classmethod
       def _partition_query(  # pylint: disable=all
           cls,
           table: Table,
           indexes: list[dict[str, Any]],
           database: Database,
           limit: int = 0,
           order_by: list[tuple[str, bool]] | None = None,
           filters: dict[Any, Any] | None = None,
       ) -> str:
           full_table_name = (
               f"{table.schema}.{table.table}" if table.schema else table.table
           )
           return f"SHOW PARTITIONS {full_table_name}"
   '''
   
   new = '''    @classmethod
       def _partition_query(  # pylint: disable=all
           cls,
           table: Table,
           indexes: list[dict[str, Any]],
           database: Database,
           limit: int = 0,
           order_by: list[tuple[str, bool]] | None = None,
           filters: dict[Any, Any] | None = None,
       ) -> str:
           return "SELECT NULL WHERE 1 = 0"
   '''
   
   if old not in text:
       raise RuntimeError("Expected HiveEngineSpec._partition_query block not 
found")
   
   path.write_text(text.replace(old, new))
   PY
   
   USER superset
   ```
   
   </details>
   
   ## Result after the workaround
   
   After applying these patches:
   
   - SQL Lab showed the correct schemas
   - selecting a schema showed the actual tables/views
   - table metadata opened correctly
   - Delta table columns were reflected correctly
   - the sidebar no longer generated invalid `DESCRIBE <schema>.<schema>` 
queries
   - partition metadata lookup no longer broke the metadata browsing flow
   
   ## Suggested upstream direction
   
   I am happy to help with a PR, but I would like guidance on the preferred 
direction.
   
   Possible options that i could think of to resolve this issue:
   
   1. Add a dedicated Spark Thrift Server engine spec instead of treating Spark 
Thrift Server exactly like classic Hive.
   2. Add Spark Thrift Server-aware logic in Superset's existing Hive/SparkSQL 
metadata reflection path.
   3. Contribute the generic `SHOW TABLES` and `DESCRIBE` reflection fixes 
upstream to PyHive.
   4. Add a Superset-side guard or capability flag to avoid Hive-only metadata 
commands such as `SHOW PARTITIONS` for Spark/Delta tables.
   
   My preference would be to start with a small, low-risk fix for the `SHOW 
TABLES` result shape, because that directly explains why schema names are shown 
as table names and why invalid `DESCRIBE <schema>.<schema>` queries can be 
generated.
   
   Then the Spark/Delta `DESCRIBE` and partition metadata behavior could be 
handled in follow-up PRs.
   
   Would the maintainers prefer this fix to live in Superset, in PyHive, or as 
a dedicated Spark Thrift Server engine spec?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to