Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/22009#discussion_r208347697
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsPushDownRequiredColumns.java
---
@@ -21,22 +21,25 @@
import org.apache.spark.sql.types.StructType;
/**
- * A mix-in interface for {@link DataSourceReader}. Data source readers
can implement this
+ * A mix-in interface for {@link ScanConfigBuilder}. Data sources can
implement this
* interface to push down required columns to the data source and only
read these columns during
* scan to reduce the size of the data to be read.
*/
@InterfaceStability.Evolving
-public interface SupportsPushDownRequiredColumns extends DataSourceReader {
+public interface SupportsPushDownRequiredColumns extends ScanConfigBuilder
{
/**
* Applies column pruning w.r.t. the given requiredSchema.
*
* Implementation should try its best to prune the unnecessary columns
or nested fields, but it's
* also OK to do the pruning partially, e.g., a data source may not be
able to prune nested
* fields, and only prune top-level columns.
- *
- * Note that, data source readers should update {@link
DataSourceReader#readSchema()} after
- * applying column pruning.
*/
void pruneColumns(StructType requiredSchema);
+
+ /**
+ * Returns the schema after the column pruning is applied, so that Spark
can know if some
+ * columns/nested fields are not pruned.
+ */
+ StructType prunedSchema();
--- End diff --
I don't see a reason to add this. Why not get the final schema from the
`ScanConfig`? Getting the schema from the `ScanConfig` is better because it is
clear when the pruned schema will be accessed: after all pushdown methods are
called.
That matters because filters may cause the source to require more columns
and the source may choose to return those columns to Spark instead of adding a
projection. Deferring the projection to Spark is more efficient if Spark was
going to add one anyway.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]