danny0405 commented on code in PR #13862:
URL: https://github.com/apache/hudi/pull/13862#discussion_r2331690278
##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java:
##########
@@ -108,24 +113,32 @@ protected ByteArrayOutputStream
serializeRecords(List<HoodieRecord> records, Hoo
// 1. Write out the log block version
output.writeInt(HoodieLogBlock.version);
- // 2. Write total number of records
- output.writeInt(records.size());
-
- // 3. Write the records
+ // 2. Pre-serialize records to handle and get accurate count
Properties props = initProperties(storage.getConf());
+ List<ByteArrayOutputStream> serializedRecords = new ArrayList<>();
for (HoodieRecord<?> s : records) {
try {
// Encode the record into bytes
// Spark Record not support write avro log
ByteArrayOutputStream data = s.getAvroBytes(schema, props);
- // Write the record size
- output.writeInt(data.size());
- // Write the content
- data.writeTo(output);
+ serializedRecords.add(data);
} catch (IOException e) {
throw new HoodieIOException("IOException converting
HoodieAvroDataBlock to bytes", e);
+ } catch (Exception e) {
+ LOG.warn("Skipping record during serialization: {}. This may be due
to concurrent archiving race conditions. "
Review Comment:
we should not modify the log serializaton for silent data skipping, if it is
because of the concurrent modification of timeline, can you add explicit lock
providers in the test?
Still we need to investigate which field in `HoodieArchivedMetaEntry` is
null and what is exactly the culprit.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]