GitHub user skambha opened a pull request:
https://github.com/apache/spark/pull/11775
[SPARK-13774][SQL] - Improve error message for non-existent paths and add
tests
SPARK-13774: IllegalArgumentException: Can not create a Path from an empty
string for incorrect file path
**Overview:**
- If a non-existent path is given in this call
``
scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv")
``
it throws the following error:
`java.lang.IllegalArgumentException: Can not create a Path from an empty
string` â¦..
`It gets called from inferSchema call in
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation`
- The purpose of this JIRA is throw a better error message.
- With the fix, you will now get a _Path does not exist_ error message.
```
scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv")
org.apache.spark.sql.AnalysisException: Path does not exist:
file:/Users/ksunitha/trunk/spark/file-path-is-incorrect.csv;
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:215)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:204)
...
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:204)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:131)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:141)
... 49 elided
```
**Details**
_Changes include:_
- Check if path exists or not in resolveRelation in DataSource, and throw
an AnalysisException with message like âPath does not exist: $pathâ
- AnalysisException is thrown similar to the exceptions thrown in
resolveRelation.
- The glob path and the non glob path is checked with minimal calls to
pathExists. If the globPath is empty, then it is a nonexistent glob pattern and
an error will be thrown. In the scenario that it is not globPath, it is
necessary to only check if the first element in the Seq is valid or not.
- A new method pathExists is added to SparkHadoopUtil to check if path
exists or not.
_Test modifications:_
- Changes went in for 3 tests to account for this error checking.
- SQLQuerySuite:test("run sql directly on files") â Error message
needed to be updated.
- 2 tests failed in MetastoreDataSourcesSuite because they had a dummy
path and so test is modified to give a tempdir and allow it to move past so it
can continue to test the codepath it meant to test
_New Tests:_
2 new tests are added to DataFrameSuite to validate that glob and non-glob
path will throw the new error message.
_Testing:_
Unit tests were run with the fix.
**Notes/Questions to reviewers:**
- There is some code duplication in DataSource.scala in resolveRelation
method and also createSource with respect to getting the paths. I have not
made any changes to the createSource codepath. Should we make the change there
as well ?
- From other JIRAs, I know there is restructuring and changes going on in
this area, not sure how that will affect this changes, but since this seemed
like a starter issue, I looked into it. If we prefer not to add the overhead
of the checks, or if there is a better place to do so, let me know.
I would appreciate your review. Thanks for your time and comments.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/skambha/spark improve_errmsg
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11775.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11775
----
commit e03e7761679502a98d11e3c254cc62efcc0eb36b
Author: Sunitha Kambhampati <[email protected]>
Date: 2016-03-16T21:44:03Z
SPARK-13774 - Improve error message for non-existent paths and add tests
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]