[jira] [Created] (SPARK-28702) Display useful error message (instead of NPE) for invalid Dataset operations (e.g. calling actions inside of transformations)

Josh Rosen (JIRA) Mon, 12 Aug 2019 16:46:22 -0700

Josh Rosen created SPARK-28702:
----------------------------------

             Summary: Display useful error message (instead of NPE) for invalid 
Dataset operations (e.g. calling actions inside of transformations)
                 Key: SPARK-28702
                 URL: https://issues.apache.org/jira/browse/SPARK-28702
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Josh Rosen



In Spark, SparkContext and SparkSession can only be used on the driver, not on 
executors. For example, this means that you cannot call 
{{someDataset.collect()}} inside of a Dataset or RDD transformation.

When Spark serializes RDDs and Datasets, references to SparkContext and 
SparkSession are null'ed out (by being marked as {{@transient}} or via the 
Closure Cleaner). As a result, RDD and Dataset methods which reference use 
these driver-side-only objects (e.g. actions or transformations) will see 
{{null}} references and may fail with a {{NullPointerException}}. For example, 
in code which (via a chain of calls) tried to {{collect()}} a dataset inside of 
a Dataset.map operation:
{code:java}Caused by: java.lang.NullPointerException
at 
<http://org.apache.spark.sql.Dataset.org|org.apache.spark.sql.Dataset.org>$apache$spark$sql$Dataset$$rddQueryExecution$lzycompute(Dataset.scala:3027)
at 
<http://org.apache.spark.sql.Dataset.org|org.apache.spark.sql.Dataset.org>$apache$spark$sql$Dataset$$rddQueryExecution(Dataset.scala:3025)
at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3038)
at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3036)
[...] {code}
The resulting NPE can be _very_ confusing to users.

In SPARK-5063 I added some logic to throw clearer error messages when 
performing similar invalid actions on RDDs. This ticket's scope is to implement 
similar logic for Datasets.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-28702) Display useful error message (instead of NPE) for invalid Dataset operations (e.g. calling actions inside of transformations)

Reply via email to