[GitHub] [hudi] vinothchandar commented on a change in pull request #4743: [HUDI-3280] Cleaning up Hive-related hierarchies after refactoring

GitBox Wed, 16 Feb 2022 07:46:54 -0800


vinothchandar commented on a change in pull request #4743:
URL: https://github.com/apache/hudi/pull/4743#discussion_r808143534




##########
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieMergeOnReadTableInputFormat.java
##########
@@ -235,20 +258,117 @@ protected FileSplit makeSplit(Path file, long start, 
long length, String[] hosts
     return result;
   }
 
-  private FileSplit doMakeSplitForPathWithLogFilePath(PathWithLogFilePath 
path, long start, long length, String[] hosts, String[] inMemoryHosts) {
-    if (!path.includeBootstrapFilePath()) {
-      return path.buildSplit(path, start, length, hosts);
-    } else {
+  private FileSplit doMakeSplitForRealtimePath(HoodieRealtimePath path, long 
start, long length, String[] hosts, String[] inMemoryHosts) {
+    if (path.includeBootstrapFilePath()) {
       FileSplit bf =
           inMemoryHosts == null
               ? super.makeSplit(path.getPathWithBootstrapFileStatus(), start, 
length, hosts)
               : super.makeSplit(path.getPathWithBootstrapFileStatus(), start, 
length, hosts, inMemoryHosts);
-      return 
HoodieRealtimeInputFormatUtils.createRealtimeBoostrapBaseFileSplit(
+      return createRealtimeBoostrapBaseFileSplit(
           (BootstrapBaseFileSplit) bf,
           path.getBasePath(),
           path.getDeltaLogFiles(),
           path.getMaxCommitTime(),
-          path.getBelongsToIncrementalQuery());
+          path.getBelongsToIncrementalQuery(),
+          path.getVirtualKeyInfo()
+      );
+    }
+
+    return createRealtimeFileSplit(path, start, length, hosts);
+  }
+
+  private static boolean containsIncrementalQuerySplits(List<FileSplit> 
fileSplits) {

Review comment:
       are these methods just moved over/consolidated?

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
##########
@@ -147,18 +148,21 @@ public WriteOperationType getOperationType() {
    * been touched multiple times in the given commits, the return value will 
keep the one
    * from the latest commit.
    *
+   *
+   * @param hadoopConf
    * @param basePath The base path
    * @return the file full path to file status mapping
    */
-  public Map<String, FileStatus> getFullPathToFileStatus(String basePath) {
+  public Map<String, FileStatus> getFullPathToFileStatus(Configuration 
hadoopConf, String basePath) {
     Map<String, FileStatus> fullPathToFileStatus = new HashMap<>();
     for (List<HoodieWriteStat> stats : getPartitionToWriteStats().values()) {
       // Iterate through all the written files.
       for (HoodieWriteStat stat : stats) {
         String relativeFilePath = stat.getPath();
         Path fullPath = relativeFilePath != null ? 
FSUtils.getPartitionPath(basePath, relativeFilePath) : null;
         if (fullPath != null) {
-          FileStatus fileStatus = new FileStatus(stat.getFileSizeInBytes(), 
false, 0, 0,
+          long blockSize = FSUtils.getFs(fullPath.toString(), 
hadoopConf).getDefaultBlockSize(fullPath);

Review comment:
       block size is a per file thing in HDFS. but rarely ever set differently. 
For cloud storage its all 0s. We need to see if `getDefaultBlockSize()` is an 
RPC call. if so, then need to avoid this, even if it assumes things.

##########
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieCopyOnWriteTableInputFormat.java
##########
@@ -73,20 +74,7 @@
  *
  * NOTE: This class is invariant of the underlying file-format of the files 
being read
  */
-public class HoodieCopyOnWriteTableInputFormat extends 
FileInputFormat<NullWritable, ArrayWritable>
-    implements Configurable {
-
-  protected Configuration conf;
-
-  @Override
-  public final Configuration getConf() {
-    return conf;
-  }
-
-  @Override
-  public final void setConf(Configuration conf) {

Review comment:
       I am blanking now. but worth tracing why we needed these i.e any other 
Hive gotchas around combine input format etc?

##########
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/RealtimeFileStatus.java
##########
@@ -34,51 +35,62 @@
  * in Path.
  */
 public class RealtimeFileStatus extends FileStatus {

Review comment:
       realtime is a remnant from the design motivations for MOR snapshot 
queries. We might even revive it after the caching story




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] vinothchandar commented on a change in pull request #4743: [HUDI-3280] Cleaning up Hive-related hierarchies after refactoring

Reply via email to