[GitHub] spark pull request #14339: [SPARK-16698][SQL] Field names having dots should...

HyukjinKwon Sun, 24 Jul 2016 19:34:53 -0700

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/14339


    [SPARK-16698][SQL] Field names having dots should be allowed for 
datasources based on FileFormat

    ## What changes were proposed in this pull request?
    
    It seems this is a regression assuming from 
https://issues.apache.org/jira/browse/SPARK-16698.
    
    Field name having dots throws an exception. For example the codes below:
    
    ```scala
    val path = "/tmp/path"
    val json =""" {"a.b":"data"}"""
    spark.sparkContext
      .parallelize(json :: Nil)
      .saveAsTextFile(path)
    spark.read.json(path).collect()
    ```
    
    throws an exception as below:
    
    ```
    Unable to resolve a.b given [a.b];
    org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b];
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
        at scala.Option.getOrElse(Option.scala:121)
    ```
    
    This problem was introduced in 
https://github.com/apache/spark/commit/17eec0a71ba8713c559d641e3f43a1be726b037c#diff-27c76f96a7b2733ecfd6f46a1716e153R121
    
    When extracting the data columns, it does not count that it can contains 
dots in field names. Actually, it seems the fields name are not expected as 
quoted when defining schema. So, It not have to consider whether this is 
wrapped with quotes because the actual schema (inferred or user-given schema) 
would not have the quotes for fields. 
    
    For example, this throws an exception. (** Loading JSON from RDD is fine**)
    
    ```
    val json =""" {"a.b":"data"}"""
    val rdd = spark.sparkContext.parallelize(json :: Nil)
    spark.read.schema(StructType(Seq(StructField("`a.b`", StringType, true))))
      .json(rdd).select("`a.b`").printSchema()
    ```
    
    as below:
    
    ```
    
    cannot resolve '```a.b```' given input columns: [`a.b`];
    org.apache.spark.sql.AnalysisException: cannot resolve '```a.b```' given 
input columns: [`a.b`];
        at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
    ```
    
    ## How was this patch tested?
    
    Unit tests in `FileSourceStrategySuite`.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-16698-regression

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14339.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14339
    
----
commit cd5d04a2661dc2e56b517478200a074ac075dec1
Author: hyukjinkwon <[email protected]>
Date:   2016-07-25T02:20:02Z

    Field names having dots should be allowed

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #14339: [SPARK-16698][SQL] Field names having dots should...

Reply via email to