Josh Rosen created SPARK-28702:
----------------------------------
Summary: Display useful error message (instead of NPE) for invalid
Dataset operations (e.g. calling actions inside of transformations)
Key: SPARK-28702
URL: https://issues.apache.org/jira/browse/SPARK-28702
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.0.0
Reporter: Josh Rosen
In Spark, SparkContext and SparkSession can only be used on the driver, not on
executors. For example, this means that you cannot call
{{someDataset.collect()}} inside of a Dataset or RDD transformation.
When Spark serializes RDDs and Datasets, references to SparkContext and
SparkSession are null'ed out (by being marked as {{@transient}} or via the
Closure Cleaner). As a result, RDD and Dataset methods which reference use
these driver-side-only objects (e.g. actions or transformations) will see
{{null}} references and may fail with a {{NullPointerException}}. For example,
in code which (via a chain of calls) tried to {{collect()}} a dataset inside of
a Dataset.map operation:
{code:java}Caused by: java.lang.NullPointerException
at
<http://org.apache.spark.sql.Dataset.org|org.apache.spark.sql.Dataset.org>$apache$spark$sql$Dataset$$rddQueryExecution$lzycompute(Dataset.scala:3027)
at
<http://org.apache.spark.sql.Dataset.org|org.apache.spark.sql.Dataset.org>$apache$spark$sql$Dataset$$rddQueryExecution(Dataset.scala:3025)
at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3038)
at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3036)
[...] {code}
The resulting NPE can be _very_ confusing to users.
In SPARK-5063 I added some logic to throw clearer error messages when
performing similar invalid actions on RDDs. This ticket's scope is to implement
similar logic for Datasets.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]