[GitHub] [hudi] umehrot2 commented on a change in pull request #2893: [HUDI-1371] Support metadata based listing for Spark DataSource and Spark SQL

GitBox Wed, 12 May 2021 14:38:26 -0700


umehrot2 commented on a change in pull request #2893:
URL: https://github.com/apache/hudi/pull/2893#discussion_r631414830




##########
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java
##########
@@ -141,6 +143,31 @@ protected BaseTableMetadata(HoodieEngineContext 
engineContext, HoodieMetadataCon
         .getAllFilesInPartition(partitionPath);
   }
 
+  @Override
+  public Map<String, FileStatus[]> getAllFilesInPartitions(List<String> 
partitionPaths)
+      throws IOException {
+    if (enabled) {
+      Map<String, FileStatus[]> partitionsFilesMap = new HashMap<>();
+
+      try {
+        for (String partitionPath : partitionPaths) {
+          partitionsFilesMap.put(partitionPath, fetchAllFilesInPartition(new 
Path(partitionPath)));
+        }
+      } catch (Exception e) {
+        if (metadataConfig.enableFallback()) {
+          LOG.error("Failed to retrieve files in partitions from metadata", e);

Review comment:
       Good catch. It wasn't falling back to use file system to do the listing. 
Will fix it.

##########
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -227,7 +227,17 @@ case class HoodieFileIndex(
    */
   private def loadPartitionPathFiles(): Map[PartitionRowPath, 
Array[FileStatus]] = {
     val sparkEngine = new HoodieSparkEngineContext(new 
JavaSparkContext(spark.sparkContext))
+    val serializableConf = new 
SerializableConfiguration(spark.sessionState.newHadoopConf())
+
     val properties = new Properties()
+    // To support metadata listing via Spark SQL we allow users to pass the 
config via Hadoop Conf. Spark SQL does not

Review comment:
       Makes sense. It will be a better experience than having to pass it via 
hadoop conf. Customers would be able to enable in the spark sql session using 
`SET` commands.

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -105,6 +106,20 @@ public FileSystemBackedTableMetadata(HoodieEngineContext 
engineContext, Serializ
     return partitionPaths;
   }
 
+  @Override
+  public Map<String, FileStatus[]> getAllFilesInPartitions(List<String> 
partitionPaths)
+      throws IOException {
+    int parallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
partitionPaths.size());

Review comment:
       Will add a check to throw a more graceful exception. There should 
ideally be atleast one partition path. Even in case of non-partitioned tables, 
it expects a path.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] umehrot2 commented on a change in pull request #2893: [HUDI-1371] Support metadata based listing for Spark DataSource and Spark SQL

Reply via email to