GitHub user skambha opened a pull request:

    https://github.com/apache/spark/pull/11775

    [SPARK-13774][SQL] - Improve error message for non-existent paths and add 
tests

    SPARK-13774: IllegalArgumentException: Can not create a Path from an empty 
string for incorrect file path
    
    **Overview:**
    -   If a non-existent path is given in this call
    ``
    scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv")
    ``
    it throws the following error:
    `java.lang.IllegalArgumentException: Can not create a Path from an empty 
string` ….. 
    `It gets called from inferSchema call in 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation`
    
    -   The purpose of this JIRA is throw a better error message. 
    -   With the fix, you will now get a _Path does not exist_ error message. 
    ```
    scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv")
    org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/Users/ksunitha/trunk/spark/file-path-is-incorrect.csv;
      at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:215)
      at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:204)
      ...
      at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:204)
      at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:131)
      at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:141)
      ... 49 elided
    ```
    
    **Details**
    _Changes include:_
    -   Check if path exists or not in resolveRelation in DataSource, and throw 
an AnalysisException with message like “Path does not exist: $path”
    -   AnalysisException is thrown similar to the exceptions thrown in 
resolveRelation.
    -   The glob path and the non glob path is checked with minimal calls to 
pathExists. If the globPath is empty, then it is a nonexistent glob pattern and 
an error will be thrown. In the scenario that it is not globPath, it is 
necessary to only check if the first element in the Seq is valid or not. 
    -   A new method pathExists is added to SparkHadoopUtil to check if path 
exists or not.
    
    _Test modifications:_
    -   Changes went in for 3 tests to account for this error checking.
    -   SQLQuerySuite:test("run sql directly on files") – Error message 
needed to be updated.
    -   2 tests failed in MetastoreDataSourcesSuite because they had a dummy 
path and so test is modified to give a tempdir and allow it to move past so it 
can continue to test the codepath it meant to test
    
    _New Tests:_
    2 new tests are added to DataFrameSuite to validate that glob and non-glob 
path will throw the new error message.  
    
    _Testing:_
    Unit tests were run with the fix.
    
    **Notes/Questions to reviewers:**
    -   There is some code duplication in DataSource.scala in resolveRelation 
method and also createSource with respect to getting the paths.  I have not 
made any changes to the createSource codepath.  Should we make the change there 
as well ? 
    
    -   From other JIRAs, I know there is restructuring and changes going on in 
this area, not sure how that will affect this changes, but since this seemed 
like a starter issue, I looked into it.  If we prefer not to add the overhead 
of the checks, or if there is a better place to do so, let me know.  
    
    I would appreciate your review. Thanks for your time and comments.
        
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/skambha/spark improve_errmsg

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11775.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11775
    
----
commit e03e7761679502a98d11e3c254cc62efcc0eb36b
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2016-03-16T21:44:03Z

    SPARK-13774 - Improve error message for non-existent paths and add tests

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to