[GitHub] [hudi] umehrot2 commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

GitBox Sat, 09 Jan 2021 20:24:01 -0800


umehrot2 commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554237558




##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -221,8 +221,9 @@ public HoodieBackedTableMetadata metadata() {
   protected abstract void initialize(HoodieEngineContext engineContext, 
HoodieTableMetaClient datasetMetaClient);
 
   private void initTableMetadata() {
-    this.metadata = new HoodieBackedTableMetadata(hadoopConf.get(), 
datasetWriteConfig.getBasePath(), datasetWriteConfig.getSpillableMapBasePath(),
-        datasetWriteConfig.useFileListingMetadata(), 
datasetWriteConfig.getFileListingMetadataVerify(), false,
+    this.metadata = new HoodieBackedTableMetadata(engineContext, 
hadoopConf.get(), datasetWriteConfig.getBasePath(),

Review comment:
       Will do.

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java
##########
@@ -39,6 +39,7 @@
   // Validate contents of Metadata Table on each access against the actual 
filesystem
   public static final String METADATA_VALIDATE_PROP = METADATA_PREFIX + 
".validate";
   public static final boolean DEFAULT_METADATA_VALIDATE = false;
+  public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = false;

Review comment:
       This is already next to `METADATA_VALIDATE_PROP` for which this is the 
default (for readers).

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieMREngineContext.java
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.engine;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.function.SerializableConsumer;
+import org.apache.hudi.common.function.SerializableFunction;
+import org.apache.hudi.common.function.SerializablePairFunction;
+import org.apache.hudi.common.util.Option;
+
+import org.apache.hudi.common.util.collection.Pair;
+
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static java.util.stream.Collectors.toList;
+import static 
org.apache.hudi.common.function.FunctionWrapper.throwingFlatMapWrapper;
+import static 
org.apache.hudi.common.function.FunctionWrapper.throwingForeachWrapper;
+import static 
org.apache.hudi.common.function.FunctionWrapper.throwingMapToPairWrapper;
+import static 
org.apache.hudi.common.function.FunctionWrapper.throwingMapWrapper;
+
+/**
+ * A java based engine context that can be used from map-reduce tasks 
executing in query engines like
+ * spark, hive and presto.
+ */
+public final class HoodieMREngineContext extends HoodieEngineContext {

Review comment:
       Well it is being directly instantiated at a couple of places in 
`hudi-common`:
   1. `FileSystemViewManager`: 
https://github.com/apache/hudi/pull/2417/files#diff-1fa21ae97e976db303c47571e7cb9c34e4f94719d482e537134e231b2e944388R167
   2. `HoodieBackedTableMetadata`: 
https://github.com/apache/hudi/pull/2417/files#diff-7c43aea81a02b4f135452b50eaa36d5868081e72b37d43101ca9de1f9ebb5195R77
   
   `hudi-common` does not depend on `hudi-hadoop-mr`.  As per your other 
comment we can get rid of `1` if we pass the engine context from the caller 
side. But we should not get rid of `2` because we are treating 
`HoodieEngineContext` as the default engine context to instantiate. This 
constructor is being used from `hud-cli` for ex (does not depend on 
`hudi-hadoop-mr`:
   -  
https://github.com/apache/hudi/pull/2417/files#diff-b9829aa53bc2a7b0a655993df2c3c75d53f6a8e51f317753e51f27c63fc20341R165
   - 
https://github.com/apache/hudi/pull/2417/files#diff-b9829aa53bc2a7b0a655993df2c3c75d53f6a8e51f317753e51f27c63fc20341R197
   
   So, if we want to treat it like a default engine context to be used would be 
better to keep in `hudi-common`.

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -49,12 +60,48 @@ public 
FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public List<String> getAllPartitionPaths() throws IOException {
-    FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
     if (assumeDatePartitioning) {
+      FileSystem fs = new 
Path(datasetBasePath).getFileSystem(hadoopConf.get());
       return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, 
datasetBasePath);
-    } else {
-      return FSUtils.getAllFoldersWithPartitionMetaFile(fs, datasetBasePath);
     }
+
+    List<Path> pathsToList = new LinkedList<>();
+    pathsToList.add(new Path(datasetBasePath));
+    List<String> partitionPaths = new ArrayList<>();
+
+    // TODO: Get the parallelism from HoodieWriteConfig
+    final int fileListingParallelism = 1500;

Review comment:
       Yeah 1500 is just the maximum. The only downside with this current code 
will be that one cannot increase parallelism beyond 1500. These classes already 
have too many parameters and adding a new one here again results in changing 
all the consumers (and their consumers) which again affects more and more files.
   
   I think it would be better that in a separate PR, we should get rid of these 
individual parameters and have the consumers pass in HoodieMetadataConfig 
(which has all these params).

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -49,12 +60,48 @@ public 
FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public List<String> getAllPartitionPaths() throws IOException {
-    FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
     if (assumeDatePartitioning) {
+      FileSystem fs = new 
Path(datasetBasePath).getFileSystem(hadoopConf.get());
       return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, 
datasetBasePath);
-    } else {
-      return FSUtils.getAllFoldersWithPartitionMetaFile(fs, datasetBasePath);
     }
+
+    List<Path> pathsToList = new LinkedList<>();

Review comment:
       Yeah I adapted the same code with some changes which optimize for this 
particular case of fetching only partition paths.

##########
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java
##########
@@ -82,7 +82,7 @@
       HoodieTableMetaClient metaClient = 
partitionsToMetaClient.get(partitionPath);
       if (!fsCache.containsKey(metaClient)) {
 
-        HoodieTableFileSystemView fsView = 
FileSystemViewManager.createInMemoryFileSystemView(metaClient,
+        HoodieTableFileSystemView fsView = 
FileSystemViewManager.createInMemoryFileSystemViewForReaders(metaClient,

Review comment:
       Will do.

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java
##########
@@ -159,12 +160,12 @@ private static HoodieTableFileSystemView 
createInMemoryFileSystemView(Serializab
     return new HoodieTableFileSystemView(metaClient, timeline, 
viewConf.isIncrementalTimelineSyncEnabled());
   }
 
-  public static HoodieTableFileSystemView 
createInMemoryFileSystemView(HoodieTableMetaClient metaClient,
-                                                                       boolean 
useFileListingFromMetadata,
-                                                                       boolean 
verifyListings) {
+  public static HoodieTableFileSystemView 
createInMemoryFileSystemViewForReaders(HoodieTableMetaClient metaClient,
+      boolean useFileListingFromMetadata, boolean verifyListings) {

Review comment:
       Will do.

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -64,6 +111,6 @@ public 
FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public boolean isInSync() {
-    throw new UnsupportedOperationException();
+    return false;

Review comment:
       I guess. But setting it to true breaks 
https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/metadata/TestHoodieBackedMetadata.java#L456
 and others. I guess will look at the tests again to see if those need to be 
fixed instead. I don't really see `isInSync()` being used anywhere (in source 
code, apart from tests) except for being stored in stats somewhere. Wonder if 
its really needed.

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java
##########
@@ -159,12 +160,12 @@ private static HoodieTableFileSystemView 
createInMemoryFileSystemView(Serializab
     return new HoodieTableFileSystemView(metaClient, timeline, 
viewConf.isIncrementalTimelineSyncEnabled());
   }
 
-  public static HoodieTableFileSystemView 
createInMemoryFileSystemView(HoodieTableMetaClient metaClient,
-                                                                       boolean 
useFileListingFromMetadata,
-                                                                       boolean 
verifyListings) {
+  public static HoodieTableFileSystemView 
createInMemoryFileSystemViewForReaders(HoodieTableMetaClient metaClient,
+      boolean useFileListingFromMetadata, boolean verifyListings) {
     LOG.info("Creating InMemory based view for basePath " + 
metaClient.getBasePath());
     if (useFileListingFromMetadata) {
-      return new HoodieMetadataFileSystemView(metaClient,
+      HoodieMREngineContext engineContext = new 
HoodieMREngineContext(metaClient.getHadoopConf());

Review comment:
       Hudi has added support for other engines like Flink. Also 
HoodieSparkEngineContext cannot be used here this API is called inside 
individual map reduce tasks from engines like spark and hive. In case of spark, 
the individual tasks do not have access to spark context. The spark context is 
not serializable.

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieMREngineContext.java
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.engine;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.function.SerializableConsumer;
+import org.apache.hudi.common.function.SerializableFunction;
+import org.apache.hudi.common.function.SerializablePairFunction;
+import org.apache.hudi.common.util.Option;
+
+import org.apache.hudi.common.util.collection.Pair;
+
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static java.util.stream.Collectors.toList;
+import static 
org.apache.hudi.common.function.FunctionWrapper.throwingFlatMapWrapper;
+import static 
org.apache.hudi.common.function.FunctionWrapper.throwingForeachWrapper;
+import static 
org.apache.hudi.common.function.FunctionWrapper.throwingMapToPairWrapper;
+import static 
org.apache.hudi.common.function.FunctionWrapper.throwingMapWrapper;
+
+/**
+ * A java based engine context that can be used from map-reduce tasks 
executing in query engines like
+ * spark, hive and presto.
+ */
+public final class HoodieMREngineContext extends HoodieEngineContext {

Review comment:
       Yes but not all commands need spark context. It is created depending on 
the commands which truly need spark context.
   
   Besides this, as discussed offline there is another issue which I 
encountered when I started to fetch `hadoopConf` from `engineContext` based on 
your other comment. It results in `NPE` at places when metadata table is 
created inside spark executor tasks because `engineContext` is not serializable 
and ends up null at executors. For such scenarios in `HoodieTable` where we 
create metadata table I will have to add a check to create a 
`HoodieMREngineContext` in case engine context is `null`. Thus `hudi-client` 
needs access to this as well. So based on what we agreed to, we need something 
like a `HoodieLocalEngineContext` that can be used at all these places and can 
sit in `hudi-common`.

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -49,12 +60,48 @@ public 
FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public List<String> getAllPartitionPaths() throws IOException {
-    FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
     if (assumeDatePartitioning) {
+      FileSystem fs = new 
Path(datasetBasePath).getFileSystem(hadoopConf.get());
       return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, 
datasetBasePath);
-    } else {
-      return FSUtils.getAllFoldersWithPartitionMetaFile(fs, datasetBasePath);
     }
+
+    List<Path> pathsToList = new LinkedList<>();
+    pathsToList.add(new Path(datasetBasePath));
+    List<String> partitionPaths = new ArrayList<>();
+
+    // TODO: Get the parallelism from HoodieWriteConfig
+    final int fileListingParallelism = 1500;

Review comment:
       Will do.

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -64,6 +111,6 @@ public 
FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public boolean isInSync() {
-    throw new UnsupportedOperationException();
+    return false;

Review comment:
       Will do.

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -49,12 +60,48 @@ public 
FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public List<String> getAllPartitionPaths() throws IOException {
-    FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
     if (assumeDatePartitioning) {
+      FileSystem fs = new 
Path(datasetBasePath).getFileSystem(hadoopConf.get());
       return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, 
datasetBasePath);
-    } else {
-      return FSUtils.getAllFoldersWithPartitionMetaFile(fs, datasetBasePath);
     }
+
+    List<Path> pathsToList = new LinkedList<>();

Review comment:
       As I had mentioned, that this is already tested through code paths like 
cleaner, and I did verify that `TestCleaner` for example fails if this would 
return incorrect partitions.
   
   Adding specific tests for `FileSystemBackedMetadata` is tricky because 
`spark context` is not available in `hudi-common`. Perhaps, this can be done in 
a separate PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] umehrot2 commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Reply via email to