[GitHub] [hudi] qjqqyy commented on issue #5107: [SUPPORT] High performance costs of AvroSerializer in Datasource writing

GitBox Wed, 23 Mar 2022 11:32:37 -0700


qjqqyy commented on issue #5107:
URL: https://github.com/apache/hudi/issues/5107#issuecomment-1076683168



   On current git master, the pseudocode for `HoodieSparkUtils::createRdd` 
actually looks like this
   
   ```scala
   df.mapPartitions { rows => 
     val convert: InternalRow => GenericRecord = { row =>
       sparkAdapter.createAvroSerializer(???).serialize(row)
     }
     convert(rows)
   }
   ```
   which seems to be why an `AvroSerializer`  is created for each row.
   
   I think this patch will cause `AvroSerializer` to only be initialized once 
for each partition, can you help try it?
   ```patch
   diff --git 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
   index 69005cd75..9c63295d6 100644
   --- 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
   +++ 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
   @@ -76,7 +76,8 @@ object AvroConversionUtils {
       * @return converter accepting Catalyst payload (in the form of 
[[InternalRow]]) and transforming it into an Avro one
       */
      def createInternalRowToAvroConverter(rootCatalystType: StructType, 
rootAvroType: Schema, nullable: Boolean): InternalRow => GenericRecord = {
   -    row => sparkAdapter.createAvroSerializer(rootCatalystType, 
rootAvroType, nullable)
   +    val serializer = sparkAdapter.createAvroSerializer(rootCatalystType, 
rootAvroType, nullable)
   +    row => serializer
          .serialize(row)
          .asInstanceOf[GenericRecord]
      }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] qjqqyy commented on issue #5107: [SUPPORT] High performance costs of AvroSerializer in Datasource writing

Reply via email to