[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2893: [HUDI-1371] Support metadata based listing for Spark DataSource and Spark SQL

GitBox Sun, 09 May 2021 19:46:13 -0700


pengzhiwei2018 commented on a change in pull request #2893:
URL: https://github.com/apache/hudi/pull/2893#discussion_r629013054




##########
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java
##########
@@ -141,6 +143,31 @@ protected BaseTableMetadata(HoodieEngineContext 
engineContext, HoodieMetadataCon
         .getAllFilesInPartition(partitionPath);
   }
 
+  @Override
+  public Map<String, FileStatus[]> getAllFilesInPartitions(List<String> 
partitionPaths)
+      throws IOException {
+    if (enabled) {
+      Map<String, FileStatus[]> partitionsFilesMap = new HashMap<>();
+
+      try {
+        for (String partitionPath : partitionPaths) {
+          partitionsFilesMap.put(partitionPath, fetchAllFilesInPartition(new 
Path(partitionPath)));
+        }
+      } catch (Exception e) {
+        if (metadataConfig.enableFallback()) {
+          LOG.error("Failed to retrieve files in partitions from metadata", e);

Review comment:
       If enable the fallback here, an empty `partitionsFilesMap` will return 
if there is an Exception happened, is it right?

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -105,6 +106,20 @@ public FileSystemBackedTableMetadata(HoodieEngineContext 
engineContext, Serializ
     return partitionPaths;
   }
 
+  @Override
+  public Map<String, FileStatus[]> getAllFilesInPartitions(List<String> 
partitionPaths)
+      throws IOException {
+    int parallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
partitionPaths.size());

Review comment:
       If the `partitionPaths` is empty, the `parallelism` will be 0, there may 
be an Exception ("Positive number of partitions required")  throw out for the 
`sparkContext.parallelize(seq, parallelism)`,

##########
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -227,7 +227,17 @@ case class HoodieFileIndex(
    */
   private def loadPartitionPathFiles(): Map[PartitionRowPath, 
Array[FileStatus]] = {
     val sparkEngine = new HoodieSparkEngineContext(new 
JavaSparkContext(spark.sparkContext))
+    val serializableConf = new 
SerializableConfiguration(spark.sessionState.newHadoopConf())
+
     val properties = new Properties()
+    // To support metadata listing via Spark SQL we allow users to pass the 
config via Hadoop Conf. Spark SQL does not

Review comment:
       Should we get these `configurations` from the `spark.sessionState.conf` 
for spark? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2893: [HUDI-1371] Support metadata based listing for Spark DataSource and Spark SQL

Reply via email to