joshrosen-stripe commented on a change in pull request #26993:
[WIP][SPARK-30338][SQL] Avoid unnecessary InternalRow copies in
ParquetRowConverter
URL: https://github.com/apache/spark/pull/26993#discussion_r361050050
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
##########
@@ -318,10 +318,31 @@ private[parquet] class ParquetRowConverter(
new ParquetMapConverter(parquetType.asGroupType(), t, updater)
case t: StructType =>
+ val wrappedUpdater = {
+ if (updater.isInstanceOf[RowUpdater]) {
+ // `updater` is a RowUpdater, implying that the parent container
is a struct.
+ // We do NOT need to perform defensive copying here because either:
+ //
+ // 1. The path from the schema root to this field consists only
of nested
+ // structs, so this converter will only be invoked once per
record and
+ // we don't need to copy because copying will be done in the
final
+ // UnsafeProjection, or
+ // 2. The path from the schema root to this field contains a map
or array,
+ // in which case we will perform a recursive defensive copy
via the
Review comment:
Correctness relies on the copy actually being a deep copy. Looking elsewhere
in this file, we have comments like
```
// NOTE: We can't reuse the mutable Map here and must instantiate a new
`Map` for the next
// value. `Row.copy()` only copies row cells, it doesn't do deep copy
to objects stored in row
// cells.
```
which suggest that certain copying might be shallow, so it's important to
double-check and make sure that the copies are indeed deep.
Here, the state being copied is an `InternalRow`. To be more specific, it's
actually a `SpecificInternalRow` (I'll update the `.asInstanceOf` cast below to
reflect this). `SpecificInternalRow` extends `BaseGenericInternalRow` and
#18483 changed that to implement a deep-copy, recursively copying maps, arrays,
and structs.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]