Github user liancheng commented on a diff in the pull request:
https://github.com/apache/spark/pull/8553#discussion_r38504498
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala
---
@@ -32,31 +32,56 @@ import org.apache.spark.Logging
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.types._
+/**
+ * A Parquet [[ReadSupport]] implementation for reading Parquet records as
Catalyst
+ * [[InternalRow]]s.
+ *
+ * The API interface of [[ReadSupport]] is a little bit over complicated
because of historical
+ * reasons. In older versions of parquet-mr (say 1.6.0rc3 and prior),
[[ReadSupport]] need to be
+ * instantiated and initialized twice on both driver side and executor
side. The [[init()]] method
+ * is for driver side initialization, while [[prepareForRead()]] is for
executor side. However,
+ * starting from parquet-mr 1.6.0, it's no longer the case, and
[[ReadSupport]] is only instantiated
+ * and initialized on executor side. So, theoretically, now it's totally
fine to combine these two
+ * methods into a single initialization method. The only reason (I could
think of) to still have
+ * them here is for parquet-mr API backwards-compatibility.
+ *
+ * Due to this reason, we no longer rely on [[ReadContext]] to pass
requested schema from [[init()]]
+ * to [[prepareForRead()]], but use a private `var` for simplicity.
+ */
private[parquet] class CatalystReadSupport extends
ReadSupport[InternalRow] with Logging {
- // Called after `init()` when initializing Parquet record reader.
+ private var catalystRequestedSchema: StructType = _
+
+ /**
+ * Called on executor side before [[prepareForRead()]] and instantiating
actual Parquet record
+ * readers. Responsible for figuring out Parquet requested schema used
for column pruning.
+ */
+ override def init(context: InitContext): ReadContext = {
--- End diff --
Moved this method in front of `prepareForRead()` for better readability,
since this method is called right before `prepareForRead()`.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]