Yin Huai created SPARK-16121:
--------------------------------

             Summary: ListingFileCatalog does not list in parallel anymore
                 Key: SPARK-16121
                 URL: https://issues.apache.org/jira/browse/SPARK-16121
             Project: Spark
          Issue Type: Bug
          Components: SQL
            Reporter: Yin Huai
            Priority: Blocker


In ListingFileCatalog, the implementation of {{listLeafFiles}} is shown below. 
When the number of user-provided paths is less than the value of 
{{sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold}}, we will 
not use parallel listing, which is different from what 1.6 does (for 1.6, if 
the number of children of any inner dir is larger than the threshold, we will 
use the parallel listing).
{code}
protected def listLeafFiles(paths: Seq[Path]): 
mutable.LinkedHashSet[FileStatus] = {
    if (paths.length >= 
sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
      HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, sparkSession)
    } else {
      // Dummy jobconf to get to the pathFilter defined in configuration
      val jobConf = new JobConf(hadoopConf, this.getClass)
      val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
      val statuses: Seq[FileStatus] = paths.flatMap { path =>
        val fs = path.getFileSystem(hadoopConf)
        logInfo(s"Listing $path on driver")
        Try {
          HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), pathFilter)
        }.getOrElse(Array.empty[FileStatus])
      }
      mutable.LinkedHashSet(statuses: _*)
    }
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to