Gerard Maas created SPARK-24202:
-----------------------------------
Summary: Separate SQLContext dependencies from
SparkSession.implicits
Key: SPARK-24202
URL: https://issues.apache.org/jira/browse/SPARK-24202
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 2.3.0
Reporter: Gerard Maas
The current implementation of the implicits in SparkSession passes the current
active SQLContext to the SQLImplicits class. This implies that all usage of
these (extremely helpful) implicits require the prior creation of a Spark
Session instance.
Usage is typically done as follows:
{code:java}
val sparkSession = SessionBuilder....build()
import sparkSession.implicits._
{code}
This is OK in user code, but it burdens the creation of library code that uses
Spark, where static imports for _Encoder_ support is required.
A simple example would be:
{code:java}
class SparkTransformation[In: Encoder, Out: Encoder] {
def transform(ds: Dataset[In]): Dataset[Out]
}
{code}
Attempting to compile such code would result in the following exception:
Unable to find encoder for type stored in a Dataset. Primitive types (Int,
String, etc) and Product types (case classes) are supported by importing
spark.implicits._ Support for serializing other types will be added in future
releases.
The usage of the _SQLContext_ instance in _SQLImplicits_ is limited to two
utilities to transform _RDD_ and local collections into a _Dataset_.
These are 2 methods of the 46 implicit conversions offered by this class.
The request is to separate the two implicit methods that depend on the instance
creation into a separate class:
{code:java}
SQLImplicits#214-229
/**
* Creates a [[Dataset]] from an RDD.
*
* @since 1.6.0
*/
implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = {
DatasetHolder(_sqlContext.createDataset(rdd))
}
/**
* Creates a [[Dataset]] from a local Seq.
* @since 1.6.0
*/
implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T]
= {
DatasetHolder(_sqlContext.createDataset(s))
}{code}
By separating the static methods from these two methods that depend on
_sqlContext_ into separate classes, we could provide static imports for all
the other functionality and only require the instance-bound implicits for the
RDD and collection support (Which is an uncommon use case these days)
As this is potentially breaking the current interface, this might be a
candidate for Spark 3.0. Although there's nothing stopping us from creating a
separate hierarchy for the static encoders already.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]