jonvex commented on code in PR #13987:
URL: https://github.com/apache/hudi/pull/13987#discussion_r2392576979
##########
hudi-common/src/main/java/org/apache/hudi/common/model/SerializableIndexedRecord.java:
##########
Review Comment:
Do we need the schema to be set here?
##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/SparkDatasetMixin.scala:
##########
@@ -29,9 +29,17 @@ import scala.collection.JavaConverters._
trait SparkDatasetMixin {
def toDataset(spark: SparkSession, records: java.util.List[HoodieRecord[_]])
= {
- val avroRecords = records.asScala.map(
- _.getData
- .asInstanceOf[GenericRecord]
+ val record1 = records.get(0)
+ val isSerializableIndexedRecord =
record1.getData.isInstanceOf[SerializableIndexedRecord]
Review Comment:
how much perf are we saving by doing this instead of checking each record?
Also need to handle empty list so we don't get index oob
##########
hudi-common/src/main/java/org/apache/hudi/common/util/collection/ExternalSpillableMap.java:
##########
@@ -239,6 +239,7 @@ public R put(T key, R value) {
this.inMemoryMap.put(key, value);
} else {
if (diskBasedMap == null) {
+ LOG.info("{} : Initializing disk based map as max memory threshold {}
is reached", loggingContext, maxInMemorySizeInBytes);
Review Comment:
nit: you used `{},` above and then `{} :` here. Choose one
##########
hudi-common/src/main/java/org/apache/hudi/common/table/read/FileGroupReaderSchemaHandler.java:
##########
@@ -74,6 +74,10 @@ public class FileGroupReaderSchemaHandler<T> {
// requiredSchema: the requestedSchema with any additional columns required
for merging etc
protected final Schema requiredSchema;
+ // the schema for updates, usually it equals with the requiredSchema,
+ // the only exception is for incoming records, which do not include the
metadata fields.
+ protected Schema schemaForUpdates;
Review Comment:
We should think about if we want this in the schema handler. Currently all
the fields in the schema handler are final. Would it be better to just move
this to the reader context?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]