Re: [PR] [SQL] Add toV2Stats/toV2ColStat to CatalogStatistics for DSv2 connectors [spark]

via GitHub Sat, 07 Mar 2026 10:08:41 -0800


huan233usc commented on code in PR #54667:
URL: https://github.com/apache/spark/pull/54667#discussion_r2900166209



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala:
##########
@@ -862,6 +864,34 @@ case class CatalogStatistics(
     }
   }
 
+  /**
+   * Convert [[CatalogStatistics]] to v2 connector [[V2Statistics]], matching 
column stats to
+   * schema columns by name. All available statistics are returned 
unconditionally; callers are
+   * responsible for gating on CBO / planStats settings before using numRows 
or columnStats.
+   *
+   * @param schema combined data + partition schema, used to resolve the 
DataType for each column
+   *               when deserializing min/max string values
+   */
+  def toV2Stats(schema: StructType): V2Statistics = {
+    val typeMap = schema.fields.map(f => f.name -> f.dataType).toMap
+    val colStatsMap: Map[NamedReference, ColumnStatistics] = colStats.flatMap 
{ case (name, stat) =>
+      typeMap.get(name).map { dt =>
+        FieldReference.apply(name) -> stat.toV2ColStat(name, dt)

Review Comment:
   Done — switched to `FieldReference.column(name)` in `toV2Stats`. This treats 
the column name as a single identifier, avoiding the `CatalystSqlParser` path 
that would split a name like `a.b` into a multi-part reference.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala:
##########
@@ -936,6 +966,29 @@ case class CatalogColumnStat(
       maxLen = maxLen,
       histogram = histogram,
       version = version)
+
+  /**
+   * Convert [[CatalogColumnStat]] to a v2 connector [[ColumnStatistics]].
+   * min/max are deserialized from their external string representation using 
the column's DataType.
+   */
+  def toV2ColStat(colName: String, dataType: DataType): ColumnStatistics = {
+    val parsedMin = min.map(CatalogColumnStat.fromExternalString(_, colName, 
dataType, version))
+    val parsedMax = max.map(CatalogColumnStat.fromExternalString(_, colName, 
dataType, version))
+    val v2DistinctCount =
+      distinctCount.map(v => 
OptionalLong.of(v.longValue)).getOrElse(OptionalLong.empty())
+    val v2NullCount =
+      nullCount.map(v => 
OptionalLong.of(v.longValue)).getOrElse(OptionalLong.empty())
+    val v2AvgLen = avgLen.map(OptionalLong.of).getOrElse(OptionalLong.empty())
+    val v2MaxLen = maxLen.map(OptionalLong.of).getOrElse(OptionalLong.empty())
+    new ColumnStatistics {
+      override def distinctCount(): OptionalLong = v2DistinctCount
+      override def min(): Optional[Object] = 
Optional.ofNullable(parsedMin.orNull)
+      override def max(): Optional[Object] = 
Optional.ofNullable(parsedMax.orNull)
+      override def nullCount(): OptionalLong = v2NullCount
+      override def avgLen(): OptionalLong = v2AvgLen
+      override def maxLen(): OptionalLong = v2MaxLen

Review Comment:
   Fixed — `toV2ColStat` now converts `histogram` to the V2 
`Histogram`/`HistogramBin` interfaces, mirroring what 
`DataSourceV2Relation.transformV2Stats` does in the reverse direction. The V2 
types are imported with aliases (`V2Histogram`, `V2HistogramBin`) to avoid 
shadowing the internal catalyst case classes already in scope.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SQL] Add toV2Stats/toV2ColStat to CatalogStatistics for DSv2 connectors [spark]

Reply via email to