Xin Huang created SPARK-55884:
---------------------------------
Summary: Add utility to convert CatalogStatistics to V2 Statistics
for DSv2 connectors
Key: SPARK-55884
URL: https://issues.apache.org/jira/browse/SPARK-55884
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.2.0
Reporter: Xin Huang
Currently, DSv2 connectors (such as Delta Kernel) often need to utilize
statistics stored in the V1 catalog (e.g., Hive Metastore) to optimize query
planning. However, there is no shared utility in Spark to convert these V1
`CatalogStatistics` and `CatalogColumnStat` objects into the V2
`org.apache.spark.sql.connector.read.Statistics` and `ColumnStatistics`
interfaces expected by DSv2 APIs.
This leads to connectors having to reimplement this conversion logic themselves.
This ticket proposes adding a new utility method,
`DataSourceV2Relation.transformV1Stats`, to perform this conversion. This would
mirror the existing `DataSourceV2Relation.transformV2Stats` (which handles the
reverse V2 -> V1 direction) and ensure consistent handling of:
* Table-level stats (sizeInBytes, rowCount)
* Column-level stats (min, max, nullCount, distinctCount, avgLen, maxLen)
* Histograms (converting internal Histogram types to V2 interfaces)
Placing this logic in `DataSourceV2Relation` ensures that V1 catalog classes
remain decoupled from V2 connector interfaces.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]