[GitHub] spark pull request: [SPARK-8000][SQL] Support for auto-detecting d...

HyukjinKwon Fri, 19 Feb 2016 03:09:37 -0800

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11270#discussion_r53448453
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ResolvedDataSource.scala
 ---
    @@ -130,7 +141,49 @@ object ResolvedDataSource extends Logging {
           bucketSpec: Option[BucketSpec],
           provider: String,
           options: Map[String, String]): ResolvedDataSource = {
    -    val clazz: Class[_] = lookupDataSource(provider)
    +    // Here, it tries to find out data source by file extensions if the 
`format()` is not called.
    +    // The auto-detection is based on given paths and it recognizes glob 
pattern as well but
    +    // it does not recursively check the sub-paths even if the given paths 
are directories.
    +    // This source detection goes the following steps
    +    //
    +    //   1. Check `provider` and use this if this is not `null`.
    +    //   2. If `provider` is not given, then it tries to detect the source 
types by extension.
    +    //      at this point, if detects only if all the given paths have the 
same extension.
    +    //   3. if it fails to detect, use the datasource given to 
`spark.sql.sources.default`.
    +    //
    +    val paths = {
    +      val caseInsensitiveOptions = new CaseInsensitiveMap(options)
    +      if (caseInsensitiveOptions.contains("paths") &&
    +        caseInsensitiveOptions.contains("path")) {
    +        throw new AnalysisException(s"Both path and paths options are 
present.")
    +      }
    +      caseInsensitiveOptions.get("paths")
    +        .map(_.split("(?<!\\\\),").map(StringUtils.unEscapeString(_, '\\', 
',')))
    +        .getOrElse(Array(caseInsensitiveOptions("path")))
    +        .flatMap{ pathString =>
    +        val hdfsPath = new Path(pathString)
    +        val fs = 
hdfsPath.getFileSystem(sqlContext.sparkContext.hadoopConfiguration)
    +        val qualified = hdfsPath.makeQualified(fs.getUri, 
fs.getWorkingDirectory)
    +        SparkHadoopUtil.get.globPathIfNecessary(qualified).map(_.toString)
    +      }
    +    }
    +    val safeProvider = Option(provider).getOrElse {
    +      val safePaths = paths.filterNot { path =>
    +        FilenameUtils.getBaseName(path)
    +        path.startsWith("_") || path.startsWith(".")
    +      }
    +      val extensions = safePaths.map { path =>
    +        FilenameUtils.getExtension(path).toLowerCase
    +      }
    +      val defaultDataSourceName = sqlContext.conf.defaultDataSourceName
    +      if (extensions.exists(extensions.head != _)) {
    +        defaultDataSourceName
    --- End diff --
    
    Then original call `read().load()` for files having no extensions would 
throw an exception which breaks backword compatibility.
    
    If we drops this for Spark 2.0, I think that is a good idea.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8000][SQL] Support for auto-detecting d...

Reply via email to