Github user jose-torres commented on a diff in the pull request:
https://github.com/apache/spark/pull/22009#discussion_r208987767
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartition.java
---
@@ -22,18 +22,16 @@
import org.apache.spark.annotation.InterfaceStability;
/**
- * An input partition returned by {@link
DataSourceReader#planInputPartitions()} and is
- * responsible for creating the actual data reader of one RDD partition.
- * The relationship between {@link InputPartition} and {@link
InputPartitionReader}
- * is similar to the relationship between {@link Iterable} and {@link
java.util.Iterator}.
+ * An input partition returned by {@link
ReadSupport#planInputPartitions(ScanConfig)}, which
+ * represents a data split that should be processed by one Spark task.
--- End diff --
I'm not sure we need to talk about "data split" - I don't think people will
try to implement data sources without knowing what a partition is in Spark.
I'd suggest saying "A serializable representation of an input
partition...", to make it clear that this should just contain metadata required
to identify what the partition is and not the actual data.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]