Github user maropu commented on a diff in the pull request:
https://github.com/apache/spark/pull/11270#discussion_r53447740
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ResolvedDataSource.scala
---
@@ -130,7 +141,49 @@ object ResolvedDataSource extends Logging {
bucketSpec: Option[BucketSpec],
provider: String,
options: Map[String, String]): ResolvedDataSource = {
- val clazz: Class[_] = lookupDataSource(provider)
+ // Here, it tries to find out data source by file extensions if the
`format()` is not called.
+ // The auto-detection is based on given paths and it recognizes glob
pattern as well but
+ // it does not recursively check the sub-paths even if the given paths
are directories.
+ // This source detection goes the following steps
+ //
+ // 1. Check `provider` and use this if this is not `null`.
+ // 2. If `provider` is not given, then it tries to detect the source
types by extension.
+ // at this point, if detects only if all the given paths have the
same extension.
+ // 3. if it fails to detect, use the datasource given to
`spark.sql.sources.default`.
+ //
+ val paths = {
+ val caseInsensitiveOptions = new CaseInsensitiveMap(options)
+ if (caseInsensitiveOptions.contains("paths") &&
+ caseInsensitiveOptions.contains("path")) {
+ throw new AnalysisException(s"Both path and paths options are
present.")
+ }
+ caseInsensitiveOptions.get("paths")
+ .map(_.split("(?<!\\\\),").map(StringUtils.unEscapeString(_, '\\',
',')))
+ .getOrElse(Array(caseInsensitiveOptions("path")))
+ .flatMap{ pathString =>
+ val hdfsPath = new Path(pathString)
+ val fs =
hdfsPath.getFileSystem(sqlContext.sparkContext.hadoopConfiguration)
+ val qualified = hdfsPath.makeQualified(fs.getUri,
fs.getWorkingDirectory)
+ SparkHadoopUtil.get.globPathIfNecessary(qualified).map(_.toString)
+ }
+ }
+ val safeProvider = Option(provider).getOrElse {
+ val safePaths = paths.filterNot { path =>
+ FilenameUtils.getBaseName(path)
+ path.startsWith("_") || path.startsWith(".")
+ }
+ val extensions = safePaths.map { path =>
+ FilenameUtils.getExtension(path).toLowerCase
+ }
+ val defaultDataSourceName = sqlContext.conf.defaultDataSourceName
+ if (extensions.exists(extensions.head != _)) {
+ defaultDataSourceName
--- End diff --
An alternative idea is to throw an exception ASAP so that users can easily
understand error reasons.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]