[jira] [Created] (SPARK-24202) Separate SQLContext dependencies from SparkSession.implicits

Gerard Maas (JIRA) Mon, 07 May 2018 13:36:19 -0700

Gerard Maas created SPARK-24202:
-----------------------------------

             Summary: Separate SQLContext dependencies from 
SparkSession.implicits
                 Key: SPARK-24202
                 URL: https://issues.apache.org/jira/browse/SPARK-24202
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: Gerard Maas



The current implementation of the implicits in SparkSession passes the current 
active SQLContext to the SQLImplicits class. This implies that all usage of 
these (extremely helpful) implicits require the prior creation of a Spark 
Session instance.

Usage is typically done as follows:

 
{code:java}
val sparkSession = SessionBuilder....build()
import sparkSession.implicits._
{code}
 

This is OK in user code, but it burdens the creation of library code that uses 
Spark, where  static imports for _Encoder_ support is required.

A simple example would be:

 
{code:java}
class SparkTransformation[In: Encoder, Out: Encoder] {
    def transform(ds: Dataset[In]): Dataset[Out]
}
{code}
 

Attempting to compile such code would result in the following exception:

Unable to find encoder for type stored in a Dataset.  Primitive types (Int, 
String, etc) and Product types (case classes) are supported by importing 
spark.implicits._  Support for serializing other types will be added in future 
releases.

The usage of the _SQLContext_ instance in _SQLImplicits_ is limited to two 
utilities to transform _RDD_ and local collections into a _Dataset_.

These are 2 methods of the 46 implicit conversions offered by this class.

The request is to separate the two implicit methods that depend on the instance 
creation into a separate class:
{code:java}
SQLImplicits#214-229
/**
 * Creates a [[Dataset]] from an RDD.
 *
 * @since 1.6.0
 */
implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = {
 DatasetHolder(_sqlContext.createDataset(rdd))
}

/**
 * Creates a [[Dataset]] from a local Seq.
 * @since 1.6.0
 */
implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] 
= {
 DatasetHolder(_sqlContext.createDataset(s))
}{code}
By separating the static methods from these two methods that depend on 
_sqlContext_ into  separate classes, we could provide static imports for all 
the other functionality and only require the instance-bound  implicits for the 
RDD and collection support (Which is an uncommon use case these days)

As this is potentially breaking the current interface, this might be a 
candidate for Spark 3.0. Although there's nothing stopping us from creating a 
separate hierarchy for the static encoders already. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-24202) Separate SQLContext dependencies from SparkSession.implicits

Reply via email to