[GitHub] [spark] joshrosen-stripe commented on a change in pull request #27089: [SPARK-30414][SQL] ParquetRowConverter optimizations: arrays, maps, plus misc. constant factors

GitBox Mon, 06 Jan 2020 18:23:52 -0800

joshrosen-stripe commented on a change in pull request #27089: 
[SPARK-30414][SQL] ParquetRowConverter optimizations: arrays, maps, plus misc. 
constant factors
URL: https://github.com/apache/spark/pull/27089#discussion_r363567790


 ##########
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
 ##########
 @@ -517,20 +521,18 @@ private[parquet] class ParquetRowConverter(
 
     override def end(): Unit = updater.set(new 
GenericArrayData(currentArray.toArray))
 
-    // NOTE: We can't reuse the mutable `ArrayBuffer` here and must 
instantiate a new buffer for the
-    // next value.  `Row.copy()` only copies row cells, it doesn't do deep 
copy to objects stored
-    // in row cells.
-    override def start(): Unit = currentArray = ArrayBuffer.empty[Any]
 
 Review comment:
   `ArrayBuffer.toArray` should always return a fresh unshared array object 
(internally, it allocates a new array and then calls `copyToArray`).
   
   It doesn't do copying / cloning of the array elements themselves, but that 
shouldn't be a problem: by design, the objects that are inserted into this 
array are unshared / immutable: the map and array converters always return 
unshared objects and we always `.copy()` rows when inserting them into a map or 
array parent container (this is still true after the changes in #26993).
   
   I did a bit of archaeology and tracked down the source of the `// NOTE` 
comment here: it was added in #7231 and at that time it looks like we were 
actually passing the `mutable.ArrayBuffer` itself to `updater`: 
https://github.com/apache/spark/blame/360fe18a61538b03cac05da1c6d258e124df6feb/sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala#L321.
 The comment makes sense in that context: with that older code, we would wind 
up with `Row()` objects that contained `mutable.ArrayBuffer`s.
   
   Later, in #7724 this was changed to pass a `new 
GenericArrayData(currentArray.toArray)` to the parent updater: 
https://github.com/apache/spark/commit/c0cc0eaec67208c087a30c1b1f50c00b2c1ebf08#diff-1d6c363c04155a9328fe1f5bd08a2f90.
 At that point I think we could have safely made the change to begin reusing 
the `mutable.ArrayBuffer` since it no longer escaped its converter.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] joshrosen-stripe commented on a change in pull request #27089: [SPARK-30414][SQL] ParquetRowConverter optimizations: arrays, maps, plus misc. constant factors

Reply via email to