Re: [PR] perf: Lazy deserialization of Avro indexed record [hudi]

via GitHub Sat, 18 Oct 2025 12:45:20 -0700


jonvex commented on code in PR #13987:
URL: https://github.com/apache/hudi/pull/13987#discussion_r2392576979



##########
hudi-common/src/main/java/org/apache/hudi/common/model/SerializableIndexedRecord.java:
##########


Review Comment:
   Do we need the schema to be set here?



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/SparkDatasetMixin.scala:
##########
@@ -29,9 +29,17 @@ import scala.collection.JavaConverters._
 trait SparkDatasetMixin {
 
   def toDataset(spark: SparkSession, records: java.util.List[HoodieRecord[_]]) 
= {
-    val avroRecords = records.asScala.map(
-      _.getData
-        .asInstanceOf[GenericRecord]
+    val record1 = records.get(0)
+    val isSerializableIndexedRecord = 
record1.getData.isInstanceOf[SerializableIndexedRecord]

Review Comment:
   how much perf are we saving by doing this instead of checking each record? 
Also need to handle empty list so we don't get index oob



##########
hudi-common/src/main/java/org/apache/hudi/common/util/collection/ExternalSpillableMap.java:
##########
@@ -239,6 +239,7 @@ public R put(T key, R value) {
       this.inMemoryMap.put(key, value);
     } else {
       if (diskBasedMap == null) {
+        LOG.info("{} : Initializing disk based map as max memory threshold {} 
is reached", loggingContext, maxInMemorySizeInBytes);

Review Comment:
   nit: you used `{},` above and then `{} :` here. Choose one



##########
hudi-common/src/main/java/org/apache/hudi/common/table/read/FileGroupReaderSchemaHandler.java:
##########
@@ -74,6 +74,10 @@ public class FileGroupReaderSchemaHandler<T> {
   // requiredSchema: the requestedSchema with any additional columns required 
for merging etc
   protected final Schema requiredSchema;
 
+  // the schema for updates, usually it equals with the requiredSchema,
+  // the only exception is for incoming records, which do not include the 
metadata fields.
+  protected Schema schemaForUpdates;

Review Comment:
   We should think about if we want this in the schema handler. Currently all 
the fields in the schema handler are final. Would it be better to just move 
this to the reader context?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] perf: Lazy deserialization of Avro indexed record [hudi]

Reply via email to