nsivabalan commented on code in PR #9367:
URL: https://github.com/apache/hudi/pull/9367#discussion_r1285119538


##########
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieBaseFile.java:
##########
@@ -61,16 +62,39 @@ public HoodieBaseFile(String filePath) {
   public HoodieBaseFile(String filePath, BaseFile bootstrapBaseFile) {
     super(filePath);
     this.bootstrapBaseFile = Option.ofNullable(bootstrapBaseFile);
-    String[] fileIdAndCommitTime = getFileIdAndCommitTimeFromFileName();
+    String[] fileIdAndCommitTime = 
getFileIdAndCommitTimeFromFileName(getFileName());
     this.fileId = fileIdAndCommitTime[0];
     this.commitTime = fileIdAndCommitTime[1];
   }
 
+  public HoodieBaseFile(String filePath, String fileId, String commitTime, 
BaseFile bootstrapBaseFile) {
+    super(filePath);
+    this.bootstrapBaseFile = Option.ofNullable(bootstrapBaseFile);
+    this.fileId = fileId;
+    this.commitTime = commitTime;
+  }
+
+  private HoodieBaseFile(FileStatus fileStatus, String[] fileIdAndCommitTime, 
BaseFile bootstrapBaseFile) {
+    this(fileStatus, fileIdAndCommitTime[0], fileIdAndCommitTime[1], 
bootstrapBaseFile);
+  }
+
+  public HoodieBaseFile(FileStatus fileStatus, String fileId, String 
commitTime, BaseFile bootstrapBaseFile) {
+    super(handleExternallyGeneratedFileName(fileStatus, fileId));

Review Comment:
   the naming kind of threw me of. 
   can we name is `mayBeHandleExternallyGeneratedFiles` instead of 
`handleExternallyGeneratedFileName`



##########
hudi-common/src/main/java/org/apache/hudi/common/util/ExternalFilePathUtil.java:
##########
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+/**
+ * Utility methods for handling externally created files.
+ */
+public class ExternalFilePathUtil {
+  // Suffix acts as a marker when appended to a file path that the path was 
created by an external system and not a Hudi writer.
+  private static final String EXTERNAL_FILE_SUFFIX = "_hudiext";

Review Comment:
   oh, we are changing it to suffix is it. whats the rationale? 



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java:
##########
@@ -324,33 +325,45 @@ public static HoodieRecord<HoodieMetadataPayload> 
createPartitionListRecord(List
    * @param partition    The name of the partition
    * @param filesAdded   Mapping of files to their sizes for files which have 
been added to this partition
    * @param filesDeleted List of files which have been deleted from this 
partition
+   * @param instantTime  Commit time of the commit responsible for adding 
and/or deleting these files, will be empty during bootstrapping of the metadata 
table
    */
   public static HoodieRecord<HoodieMetadataPayload> 
createPartitionFilesRecord(String partition,
-                                                                               
Option<Map<String, Long>> filesAdded,
-                                                                               
Option<List<String>> filesDeleted) {
-    Map<String, HoodieMetadataFileInfo> fileInfo = new HashMap<>();
-    filesAdded.ifPresent(filesMap ->
-        fileInfo.putAll(
-            filesMap.entrySet().stream().collect(
-                Collectors.toMap(Map.Entry::getKey, (entry) -> {
-                  long fileSize = entry.getValue();
-                  // Assert that the file-size of the file being added is 
positive, since Hudi
-                  // should not be creating empty files
-                  checkState(fileSize > 0);
-                  return new HoodieMetadataFileInfo(fileSize, false);
-                })))
-    );
-    filesDeleted.ifPresent(filesList ->
-        fileInfo.putAll(
-            filesList.stream().collect(
-                Collectors.toMap(Function.identity(), (ignored) -> new 
HoodieMetadataFileInfo(0L, true))))
-    );
+                                                                               
Map<String, Long> filesAdded,
+                                                                               
List<String> filesDeleted,
+                                                                               
Option<String> instantTime) {
+    int size = filesAdded.size() + filesDeleted.size();
+    Map<String, HoodieMetadataFileInfo> fileInfo = new HashMap<>(size, 1);
+    filesAdded.forEach((fileName, fileSize) -> {
+      // Assert that the file-size of the file being added is positive, since 
Hudi
+      // should not be creating empty files
+      checkState(fileSize > 0);
+      fileInfo.put(handleFileName(fileName, instantTime), new 
HoodieMetadataFileInfo(fileSize, false));
+    });
+
+    filesDeleted.forEach(fileName -> fileInfo.put(handleFileName(fileName, 
instantTime), DELETE_FILE_METADATA));
 
     HoodieKey key = new HoodieKey(partition, 
MetadataPartitionType.FILES.getPartitionPath());
     HoodieMetadataPayload payload = new 
HoodieMetadataPayload(key.getRecordKey(), METADATA_TYPE_FILE_LIST, fileInfo);
     return new HoodieAvroRecord<>(key, payload);
   }
 
+  /**
+   * In the case where a file was created by something other than a Hudi 
writer, the file name will not contain the commit time. We will prefix the file 
name with hudiext_[commitTime] before storing
+   * in the metadata table. The constructor for {@link 
org.apache.hudi.common.model.HoodieBaseFile} will properly handle this prefix.
+   * @param fileName incoming file name
+   * @param commitTime time of the commit (will be empty during bootstrap 
operations)
+   * @return file name with commit time prefix if the input file name does not 
contain the commit time, otherwise returns the original input
+   */
+  private static String handleFileName(String fileName, Option<String> 
commitTime) {

Review Comment:
   yeah. thats better. anyways, we are the ones who will construct the 
WriteStatus right. 



##########
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieBaseFile.java:
##########
@@ -61,16 +62,39 @@ public HoodieBaseFile(String filePath) {
   public HoodieBaseFile(String filePath, BaseFile bootstrapBaseFile) {
     super(filePath);
     this.bootstrapBaseFile = Option.ofNullable(bootstrapBaseFile);
-    String[] fileIdAndCommitTime = getFileIdAndCommitTimeFromFileName();
+    String[] fileIdAndCommitTime = 
getFileIdAndCommitTimeFromFileName(getFileName());
     this.fileId = fileIdAndCommitTime[0];
     this.commitTime = fileIdAndCommitTime[1];
   }
 
+  public HoodieBaseFile(String filePath, String fileId, String commitTime, 
BaseFile bootstrapBaseFile) {
+    super(filePath);
+    this.bootstrapBaseFile = Option.ofNullable(bootstrapBaseFile);
+    this.fileId = fileId;
+    this.commitTime = commitTime;
+  }
+
+  private HoodieBaseFile(FileStatus fileStatus, String[] fileIdAndCommitTime, 
BaseFile bootstrapBaseFile) {
+    this(fileStatus, fileIdAndCommitTime[0], fileIdAndCommitTime[1], 
bootstrapBaseFile);
+  }
+
+  public HoodieBaseFile(FileStatus fileStatus, String fileId, String 
commitTime, BaseFile bootstrapBaseFile) {
+    super(handleExternallyGeneratedFileName(fileStatus, fileId));

Review Comment:
   got it. should be ok. don't think we need to validation for every base file 
creation. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to