qjqqyy commented on issue #5107:
URL: https://github.com/apache/hudi/issues/5107#issuecomment-1076683168
On current git master, the pseudocode for `HoodieSparkUtils::createRdd`
actually looks like this
```scala
df.mapPartitions { rows =>
val convert: InternalRow => GenericRecord = { row =>
sparkAdapter.createAvroSerializer(???).serialize(row)
}
convert(rows)
}
```
which seems to be why an `AvroSerializer` is created for each row.
I think this patch will cause `AvroSerializer` to only be initialized once
for each partition, can you help try it?
```patch
diff --git
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
index 69005cd75..9c63295d6 100644
---
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
+++
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
@@ -76,7 +76,8 @@ object AvroConversionUtils {
* @return converter accepting Catalyst payload (in the form of
[[InternalRow]]) and transforming it into an Avro one
*/
def createInternalRowToAvroConverter(rootCatalystType: StructType,
rootAvroType: Schema, nullable: Boolean): InternalRow => GenericRecord = {
- row => sparkAdapter.createAvroSerializer(rootCatalystType,
rootAvroType, nullable)
+ val serializer = sparkAdapter.createAvroSerializer(rootCatalystType,
rootAvroType, nullable)
+ row => serializer
.serialize(row)
.asInstanceOf[GenericRecord]
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]