[
https://issues.apache.org/jira/browse/SPARK-26339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16717961#comment-16717961
]
ASF GitHub Bot commented on SPARK-26339:
----------------------------------------
srowen commented on a change in pull request #23288: [SPARK-26339][SQL]Throws
better exception when reading files that start with underscore
URL: https://github.com/apache/spark/pull/23288#discussion_r240781003
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
##########
@@ -554,7 +554,8 @@ case class DataSource(
// Sufficient to check head of the globPath seq for non-glob scenario
// Don't need to check once again if files exist in streaming mode
- if (checkFilesExist && !fs.exists(globPath.head)) {
+ if (checkFilesExist &&
+ (!fs.exists(globPath.head) ||
InMemoryFileIndex.shouldFilterOut(globPath.head.getName))) {
Review comment:
I'm probably misunderstanding, but doesn't this still cause it to throw a
'Path does not exist' exception?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Behavior of reading files that start with underscore is confusing
> -----------------------------------------------------------------
>
> Key: SPARK-26339
> URL: https://issues.apache.org/jira/browse/SPARK-26339
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.0.0
> Reporter: Keiichi Hirobe
> Priority: Minor
>
> Behavior of reading files that start with underscore is as follows.
> # spark.read (no schema) throws exception which message is confusing.
> # spark.read (userSpecificationSchema) succesfully reads, but content is
> emtpy.
> Example of files are as follows.
> The same behavior occured when I read json files.
> {code:bash}
> $ cat test.csv
> test1,10
> test2,20
> $ cp test.csv _test.csv
> $ ./bin/spark-shell --master local[2]
> {code}
> Spark shell snippet for reproduction:
> {code:java}
> scala> val df=spark.read.csv("test.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string]
> scala> df.show()
> +-----+---+
> | _c0|_c1|
> +-----+---+
> |test1| 10|
> |test2| 20|
> +-----+---+
> scala> val df = spark.read.schema("test STRING, number INT").csv("test.csv")
> df: org.apache.spark.sql.DataFrame = [test: string, number: int]
> scala> df.show()
> +-----+------+
> | test|number|
> +-----+------+
> |test1| 10|
> |test2| 20|
> +-----+------+
> scala> val df=spark.read.csv("_test.csv")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It
> must be specified manually.;
> at
> org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$13(DataSource.scala:185)
> at scala.Option.getOrElse(Option.scala:138)
> at
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:185)
> at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
> at
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:231)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:219)
> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:625)
> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:478)
> ... 49 elided
> scala> val df=spark.read.schema("test STRING, number INT").csv("_test.csv")
> df: org.apache.spark.sql.DataFrame = [test: string, number: int]
> scala> df.show()
> +----+------+
> |test|number|
> +----+------+
> +----+------+
> {code}
> I noticed that spark cannot read files that start with underscore after I
> read some codes.(I could not find any documents about file name limitation)
> Above behavior is not good especially userSpecificationSchema case, I think.
> I suggest to throw exception which message is "Path does not exist" in both
> cases.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]