umehrot2 commented on a change in pull request #2893:
URL: https://github.com/apache/hudi/pull/2893#discussion_r631414830
##########
File path:
hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java
##########
@@ -141,6 +143,31 @@ protected BaseTableMetadata(HoodieEngineContext
engineContext, HoodieMetadataCon
.getAllFilesInPartition(partitionPath);
}
+ @Override
+ public Map<String, FileStatus[]> getAllFilesInPartitions(List<String>
partitionPaths)
+ throws IOException {
+ if (enabled) {
+ Map<String, FileStatus[]> partitionsFilesMap = new HashMap<>();
+
+ try {
+ for (String partitionPath : partitionPaths) {
+ partitionsFilesMap.put(partitionPath, fetchAllFilesInPartition(new
Path(partitionPath)));
+ }
+ } catch (Exception e) {
+ if (metadataConfig.enableFallback()) {
+ LOG.error("Failed to retrieve files in partitions from metadata", e);
Review comment:
Good catch. It wasn't falling back to use file system to do the listing.
Will fix it.
##########
File path:
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -227,7 +227,17 @@ case class HoodieFileIndex(
*/
private def loadPartitionPathFiles(): Map[PartitionRowPath,
Array[FileStatus]] = {
val sparkEngine = new HoodieSparkEngineContext(new
JavaSparkContext(spark.sparkContext))
+ val serializableConf = new
SerializableConfiguration(spark.sessionState.newHadoopConf())
+
val properties = new Properties()
+ // To support metadata listing via Spark SQL we allow users to pass the
config via Hadoop Conf. Spark SQL does not
Review comment:
Makes sense. It will be a better experience than having to pass it via
hadoop conf. Customers would be able to enable in the spark sql session using
`SET` commands.
##########
File path:
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -105,6 +106,20 @@ public FileSystemBackedTableMetadata(HoodieEngineContext
engineContext, Serializ
return partitionPaths;
}
+ @Override
+ public Map<String, FileStatus[]> getAllFilesInPartitions(List<String>
partitionPaths)
+ throws IOException {
+ int parallelism = Math.min(DEFAULT_LISTING_PARALLELISM,
partitionPaths.size());
Review comment:
Will add a check to throw a more graceful exception. There should
ideally be atleast one partition path. Even in case of non-partitioned tables,
it expects a path.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]