n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max 
headers on archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r378427516
 
 

 ##########
 File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieCommitArchiveLog.java
 ##########
 @@ -268,6 +270,19 @@ public Path getArchiveFilePath() {
     return archiveFilePath;
   }
 
+  private void writeHeaderBlock(Schema wrapperSchema, List<HoodieInstant> 
instants) throws Exception {
+    if (!instants.isEmpty()) {
+      Collections.sort(instants, HoodieInstant.COMPARATOR);
+      HoodieInstant minInstant = instants.get(0);
+      HoodieInstant maxInstant = instants.get(instants.size() - 1);
+      Map<HeaderMetadataType, String> metadataMap = Maps.newHashMap();
+      metadataMap.put(HeaderMetadataType.SCHEMA, wrapperSchema.toString());
+      metadataMap.put(HeaderMetadataType.MIN_INSTANT_TIME, 
minInstant.getTimestamp());
+      metadataMap.put(HeaderMetadataType.MAX_INSTANT_TIME, 
maxInstant.getTimestamp());
+      this.writer.appendBlock(new HoodieAvroDataBlock(Collections.emptyList(), 
metadataMap));
+    }
+  }
+
   private void writeToFile(Schema wrapperSchema, List<IndexedRecord> records) 
throws Exception {
 
 Review comment:
   Move the writing of the header to this part, basically, augment the same 
DataBlock that is has the archived records with the metadata information that 
you want to push here, we already write the schema, just add more entries (like 
above) to the headers here. Then you will be able to read each block and then 
filter based on whether the block should be considered or not - this is more 
generic than adding an extra empty log block to track min/max over the entire 
file (which is hard since the file keeps growing anyways) 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to