[ https://issues.apache.org/jira/browse/SPARK-16121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15344664#comment-15344664 ]
Xiangrui Meng commented on SPARK-16121: --------------------------------------- Changed the fix versions to 2.0.1 and 2.1.0 since 2.0.0-RC1 is in vote. > ListingFileCatalog does not list in parallel anymore > ---------------------------------------------------- > > Key: SPARK-16121 > URL: https://issues.apache.org/jira/browse/SPARK-16121 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Yin Huai > Assignee: Yin Huai > Priority: Blocker > Fix For: 2.1.0, 2.0.1 > > > In ListingFileCatalog, the implementation of {{listLeafFiles}} is shown > below. When the number of user-provided paths is less than the value of > {{sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold}}, we > will not use parallel listing, which is different from what 1.6 does (for > 1.6, if the number of children of any inner dir is larger than the threshold, > we will use the parallel listing). > {code} > protected def listLeafFiles(paths: Seq[Path]): > mutable.LinkedHashSet[FileStatus] = { > if (paths.length >= > sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) { > HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, > sparkSession) > } else { > // Dummy jobconf to get to the pathFilter defined in configuration > val jobConf = new JobConf(hadoopConf, this.getClass) > val pathFilter = FileInputFormat.getInputPathFilter(jobConf) > val statuses: Seq[FileStatus] = paths.flatMap { path => > val fs = path.getFileSystem(hadoopConf) > logInfo(s"Listing $path on driver") > Try { > HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), > pathFilter) > }.getOrElse(Array.empty[FileStatus]) > } > mutable.LinkedHashSet(statuses: _*) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org