yui2010 commented on a change in pull request #2378:
URL: https://github.com/apache/hudi/pull/2378#discussion_r570113203
##########
File path:
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala
##########
@@ -77,18 +81,26 @@ object HoodieSparkUtils {
* @return list of absolute file paths
*/
def checkAndGlobPathIfNecessary(paths: Seq[String], fs: FileSystem):
Seq[Path] = {
+ val globPaths =
paths.flatMap(path => {
val qualified = new Path(path).makeQualified(fs.getUri,
fs.getWorkingDirectory)
val globPaths = globPathIfNecessary(fs, qualified)
globPaths
})
+ val filteredGlobPaths = globPaths.filterNot( path =>
TablePathUtils.isHoodieMetaPath(path.toString) || shouldFilterOut(path.getName))
Review comment:
there have two reason for filter the hoodie meta path :
1. if our loadPath like bathPath/\*/\* it will load all
.hoodie/*.deltacommit file and cause spark do many fs.listStatus . this is
uneffectively
2. load all .hoodie/*.deltacommit file will cause exception because
discoveredBasePaths.distinct.size is 2 when we use spark listFiles to prunes
partitions
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]