joshrosen-stripe commented on a change in pull request #25503:
[SPARK-28702][SQL] Display useful error message (instead of NPE) for invalid
Dataset operations
URL: https://github.com/apache/spark/pull/25503#discussion_r315484406
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -184,11 +184,26 @@ private[sql] object Dataset {
*/
@Stable
class Dataset[T] private[sql](
- @transient val sparkSession: SparkSession,
+ @transient val _sparkSession: SparkSession,
@DeveloperApi @Unstable @transient val queryExecution: QueryExecution,
@DeveloperApi @Unstable @transient val encoder: Encoder[T])
extends Serializable {
+ def sparkSession: SparkSession = {
+ if (_sparkSession == null) {
+ throw new SparkException(
+ "This Dataset lacks a SparkSession. It could happen in the following
cases: \n(1) Dataset " +
+ "transformations and actions are NOT invoked by the driver, but inside
of other " +
+ "transformations; for example, dataset1.map(x => dataset2.values.count()
* x) is invalid " +
+ "because the values transformation and count action cannot be performed
inside of the " +
+ "dataset1.map transformation. For more information, see
SPARK-28702.\n(2) When a Spark " +
Review comment:
We may want to either re-word or remove bullet point (2) because it's
discussing DStreams but I think those are unlikely to be used with Datasets.
(For reference, https://github.com/apache/spark/pull/11595 added this
wording for the RDD version of this patch).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]