GitHub user skambha opened a pull request: https://github.com/apache/spark/pull/11775
[SPARK-13774][SQL] - Improve error message for non-existent paths and add tests SPARK-13774: IllegalArgumentException: Can not create a Path from an empty string for incorrect file path **Overview:** - If a non-existent path is given in this call `` scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv") `` it throws the following error: `java.lang.IllegalArgumentException: Can not create a Path from an empty string` â¦.. `It gets called from inferSchema call in org.apache.spark.sql.execution.datasources.DataSource.resolveRelation` - The purpose of this JIRA is throw a better error message. - With the fix, you will now get a _Path does not exist_ error message. ``` scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv") org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/ksunitha/trunk/spark/file-path-is-incorrect.csv; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:215) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:204) ... at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:204) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:131) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:141) ... 49 elided ``` **Details** _Changes include:_ - Check if path exists or not in resolveRelation in DataSource, and throw an AnalysisException with message like âPath does not exist: $pathâ - AnalysisException is thrown similar to the exceptions thrown in resolveRelation. - The glob path and the non glob path is checked with minimal calls to pathExists. If the globPath is empty, then it is a nonexistent glob pattern and an error will be thrown. In the scenario that it is not globPath, it is necessary to only check if the first element in the Seq is valid or not. - A new method pathExists is added to SparkHadoopUtil to check if path exists or not. _Test modifications:_ - Changes went in for 3 tests to account for this error checking. - SQLQuerySuite:test("run sql directly on files") â Error message needed to be updated. - 2 tests failed in MetastoreDataSourcesSuite because they had a dummy path and so test is modified to give a tempdir and allow it to move past so it can continue to test the codepath it meant to test _New Tests:_ 2 new tests are added to DataFrameSuite to validate that glob and non-glob path will throw the new error message. _Testing:_ Unit tests were run with the fix. **Notes/Questions to reviewers:** - There is some code duplication in DataSource.scala in resolveRelation method and also createSource with respect to getting the paths. I have not made any changes to the createSource codepath. Should we make the change there as well ? - From other JIRAs, I know there is restructuring and changes going on in this area, not sure how that will affect this changes, but since this seemed like a starter issue, I looked into it. If we prefer not to add the overhead of the checks, or if there is a better place to do so, let me know. I would appreciate your review. Thanks for your time and comments. You can merge this pull request into a Git repository by running: $ git pull https://github.com/skambha/spark improve_errmsg Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11775.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11775 ---- commit e03e7761679502a98d11e3c254cc62efcc0eb36b Author: Sunitha Kambhampati <skam...@us.ibm.com> Date: 2016-03-16T21:44:03Z SPARK-13774 - Improve error message for non-existent paths and add tests ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org