[GitHub] spark pull request #17847: [SPARK-20590] Map default input data source forma...

sameeragarwal Wed, 03 May 2017 15:21:02 -0700

GitHub user sameeragarwal opened a pull request:

    https://github.com/apache/spark/pull/17847


    [SPARK-20590] Map default input data source formats to inlined classes

    ## What changes were proposed in this pull request?
    
    One of the common usability problems around reading data in spark 
(particularly CSV) is that there can often be a conflict between different 
readers in the classpath.
    
    As an example, if someone launches a 2.x spark shell with the spark-csv 
package in the classpath, Spark currently fails in an extremely unfriendly way 
(see https://github.com/databricks/spark-csv/issues/367):
    
    ```scala
    ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
    scala> val df = spark.read.csv("/foo/bar.csv")
    java.lang.RuntimeException: Multiple sources found for csv 
(org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, 
com.databricks.spark.csv.DefaultSource15), please specify the fully qualified 
class name.
      at scala.sys.package$.error(package.scala:27)
      at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574)
      at 
org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85)
      at 
org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85)
      at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295)
      at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
      at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
      at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
      ... 48 elided
    ```
    
    This patch proposes a simple way of fixing this error by always mapping 
default input data source formats to inlined classes (that exist in Spark):
    
    ```scala
    ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
    scala> val df = spark.read.csv("/foo/bar.csv")
    df: org.apache.spark.sql.DataFrame = [_c0: string]
    ```
    
    ## How was this patch tested?
    
    Existing Tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sameeragarwal/spark csv-fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17847.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17847
    
----
commit 1af4675707a6fe1a1acaff0a30e8ef6c2ed5ff46
Author: Sameer Agarwal <[email protected]>
Date:   2017-05-03T20:57:16Z

    map short names to correct class

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #17847: [SPARK-20590] Map default input data source forma...

Reply via email to