itholic commented on code in PR #36267:
URL: https://github.com/apache/spark/pull/36267#discussion_r854676432
##########
python/docs/source/development/debugging.rst:
##########
@@ -332,3 +332,273 @@ The UDF IDs can be seen in the query plan, for example,
``add1(...)#2L`` in ``Ar
This feature is not supported with registered UDFs.
+
+Common Exceptions / Errors
+--------------------------
+
+PySpark SQL
+~~~~~~~~~~~
+
+- **AnalysisException**
+
+AnalysisException is raised when failing to analyze a SQL query plan.
+
+Example:
+
+.. code-block:: python
+
+ >>> df = spark.range(1)
+ >>> df['bad_key']
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.AnalysisException: Cannot resolve column name "bad_key"
among (id)
+
+Solution:
+
+.. code-block:: python
+
+ >>> df['id']
+ Column<'id'>
+
+- **ParseException**
+
+ParseException is raised when failing to parse a SQL command.
+
+Example:
+
+.. code-block:: python
+
+ >>> spark.sql("select * 1")
+ Traceback (most recent call last):
+ …
Review Comment:
ditto
##########
python/docs/source/development/debugging.rst:
##########
@@ -332,3 +332,273 @@ The UDF IDs can be seen in the query plan, for example,
``add1(...)#2L`` in ``Ar
This feature is not supported with registered UDFs.
+
+Common Exceptions / Errors
+--------------------------
+
+PySpark SQL
+~~~~~~~~~~~
+
+- **AnalysisException**
+
+AnalysisException is raised when failing to analyze a SQL query plan.
+
+Example:
+
+.. code-block:: python
+
+ >>> df = spark.range(1)
+ >>> df['bad_key']
+ Traceback (most recent call last):
+ …
Review Comment:
Seems like we should change `…` to `...`.
##########
python/docs/source/development/debugging.rst:
##########
@@ -332,3 +332,273 @@ The UDF IDs can be seen in the query plan, for example,
``add1(...)#2L`` in ``Ar
This feature is not supported with registered UDFs.
+
+Common Exceptions / Errors
+--------------------------
+
+PySpark SQL
+~~~~~~~~~~~
+
+- **AnalysisException**
+
+AnalysisException is raised when failing to analyze a SQL query plan.
+
+Example:
+
+.. code-block:: python
+
+ >>> df = spark.range(1)
+ >>> df['bad_key']
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.AnalysisException: Cannot resolve column name "bad_key"
among (id)
+
+Solution:
+
+.. code-block:: python
+
+ >>> df['id']
+ Column<'id'>
+
+- **ParseException**
+
+ParseException is raised when failing to parse a SQL command.
+
+Example:
+
+.. code-block:: python
+
+ >>> spark.sql("select * 1")
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.ParseException:
+ Syntax error at or near '1': extra input '1'(line 1, pos 9)
+ == SQL ==
+ select * 1
+ ---------^^^
+
+Solution:
+
+.. code-block:: python
+
+ >>> spark.sql("select *")
+ DataFrame[]
+
+- **IllegalArgumentException**
+
+IllegalArgumentException is raised when passing an illegal or inappropriate
argument.
+
+Example:
+
+.. code-block:: python
+
+ >>> spark.range(1).sample(-1.0)
+ Traceback (most recent call last):
+ …
Review Comment:
ditto
##########
python/docs/source/development/debugging.rst:
##########
@@ -332,3 +332,273 @@ The UDF IDs can be seen in the query plan, for example,
``add1(...)#2L`` in ``Ar
This feature is not supported with registered UDFs.
+
+Common Exceptions / Errors
+--------------------------
+
+PySpark SQL
+~~~~~~~~~~~
+
+- **AnalysisException**
+
+AnalysisException is raised when failing to analyze a SQL query plan.
+
+Example:
+
+.. code-block:: python
+
+ >>> df = spark.range(1)
+ >>> df['bad_key']
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.AnalysisException: Cannot resolve column name "bad_key"
among (id)
+
+Solution:
+
+.. code-block:: python
+
+ >>> df['id']
+ Column<'id'>
+
+- **ParseException**
+
+ParseException is raised when failing to parse a SQL command.
+
+Example:
+
+.. code-block:: python
+
+ >>> spark.sql("select * 1")
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.ParseException:
+ Syntax error at or near '1': extra input '1'(line 1, pos 9)
+ == SQL ==
+ select * 1
+ ---------^^^
+
+Solution:
+
+.. code-block:: python
+
+ >>> spark.sql("select *")
+ DataFrame[]
+
+- **IllegalArgumentException**
+
+IllegalArgumentException is raised when passing an illegal or inappropriate
argument.
+
+Example:
+
+.. code-block:: python
+
+ >>> spark.range(1).sample(-1.0)
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.IllegalArgumentException: requirement failed: Sampling
fraction (-1.0) must be on interval [0, 1] without replacement
+
+Solution:
+
+.. code-block:: python
+
+ >>> spark.range(1).sample(1.0)
+ DataFrame[id: bigint]
+
+- **PythonException**
+
+PythonException is thrown from Python workers.
+
+You can see the type of exception that was thrown from the Python worker and
its stack trace, here “TypeError”.
+
+Example:
+
+.. code-block:: python
+
+ >>> from pyspark.sql.functions import udf
+ >>> def f(x):
+ ... return F.abs(x)
+ ...
+ >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+ 22/04/12 14:52:31 ERROR Executor: Exception in task 7.0 in stage 37.0 (TID
232)
+ org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
+ …
+ TypeError: Invalid argument, not a string or column: -1 of type <class
'int'>. For column literals, use 'lit', 'array', 'struct' or 'create_map'
function.
+
+Solution:
+
+.. code-block:: python
+
+ >>> def f(x):
+ ... return abs(x)
+ ...
+ >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+ [Row(id=-1, abs='1'), Row(id=0, abs='0')]
+
+- **StreamingQueryException**
+
+StreamingQueryException is raised when failing a StreamingQuery. Most often,
it is thrown from Python workers, that wrap it as a PythonException.
+
+Example:
+
+.. code-block:: python
+
+ >>> sdf =
spark.readStream.format("text").load("python/test_support/sql/streaming")
+ >>> from pyspark.sql.functions import col, udf
+ >>> bad_udf = udf(lambda x: 1 / 0)
+ >>>
(sdf.select(bad_udf(col("value"))).writeStream.format("memory").queryName("q1").start()).processAllAvailable()
+ Traceback (most recent call last):
+ …
Review Comment:
ditto
##########
python/docs/source/development/debugging.rst:
##########
@@ -332,3 +332,273 @@ The UDF IDs can be seen in the query plan, for example,
``add1(...)#2L`` in ``Ar
This feature is not supported with registered UDFs.
+
+Common Exceptions / Errors
+--------------------------
+
+PySpark SQL
+~~~~~~~~~~~
+
+- **AnalysisException**
+
+AnalysisException is raised when failing to analyze a SQL query plan.
+
+Example:
+
+.. code-block:: python
+
+ >>> df = spark.range(1)
+ >>> df['bad_key']
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.AnalysisException: Cannot resolve column name "bad_key"
among (id)
+
+Solution:
+
+.. code-block:: python
+
+ >>> df['id']
+ Column<'id'>
+
+- **ParseException**
+
+ParseException is raised when failing to parse a SQL command.
+
+Example:
+
+.. code-block:: python
+
+ >>> spark.sql("select * 1")
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.ParseException:
+ Syntax error at or near '1': extra input '1'(line 1, pos 9)
+ == SQL ==
+ select * 1
+ ---------^^^
+
+Solution:
+
+.. code-block:: python
+
+ >>> spark.sql("select *")
+ DataFrame[]
+
+- **IllegalArgumentException**
+
+IllegalArgumentException is raised when passing an illegal or inappropriate
argument.
+
+Example:
+
+.. code-block:: python
+
+ >>> spark.range(1).sample(-1.0)
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.IllegalArgumentException: requirement failed: Sampling
fraction (-1.0) must be on interval [0, 1] without replacement
+
+Solution:
+
+.. code-block:: python
+
+ >>> spark.range(1).sample(1.0)
+ DataFrame[id: bigint]
+
+- **PythonException**
+
+PythonException is thrown from Python workers.
+
+You can see the type of exception that was thrown from the Python worker and
its stack trace, here “TypeError”.
+
+Example:
+
+.. code-block:: python
+
+ >>> from pyspark.sql.functions import udf
+ >>> def f(x):
+ ... return F.abs(x)
+ ...
+ >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+ 22/04/12 14:52:31 ERROR Executor: Exception in task 7.0 in stage 37.0 (TID
232)
+ org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
+ …
+ TypeError: Invalid argument, not a string or column: -1 of type <class
'int'>. For column literals, use 'lit', 'array', 'struct' or 'create_map'
function.
+
+Solution:
+
+.. code-block:: python
+
+ >>> def f(x):
+ ... return abs(x)
+ ...
+ >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+ [Row(id=-1, abs='1'), Row(id=0, abs='0')]
+
+- **StreamingQueryException**
+
+StreamingQueryException is raised when failing a StreamingQuery. Most often,
it is thrown from Python workers, that wrap it as a PythonException.
+
+Example:
+
+.. code-block:: python
+
+ >>> sdf =
spark.readStream.format("text").load("python/test_support/sql/streaming")
+ >>> from pyspark.sql.functions import col, udf
+ >>> bad_udf = udf(lambda x: 1 / 0)
+ >>>
(sdf.select(bad_udf(col("value"))).writeStream.format("memory").queryName("q1").start()).processAllAvailable()
+ Traceback (most recent call last):
+ …
+ org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
+ File "<stdin>", line 1, in <lambda>
+ ZeroDivisionError: division by zero
+ …
+ pyspark.sql.utils.StreamingQueryException: Query q1 [id =
ced5797c-74e2-4079-825b-f3316b327c7d, runId =
65bacaf3-9d51-476a-80ce-0ac388d4906a] terminated with exception: Writing job
aborted
+
+Solution:
+
+Fix the StreamingQuery and re-execute the workflow.
+
+- **SparkUpgradeException**
+
+SparkUpgradeException is thrown because of Spark upgrade.
+
+Example:
+
+.. code-block:: python
+
+ >>> from pyspark.sql.functions import to_date, unix_timestamp,
from_unixtime
+ >>> df = spark.createDataFrame([("2014-31-12",)], ["date_str"])
+ >>> df2 = df.select("date_str",
to_date(from_unixtime(unix_timestamp("date_str", "yyyy-dd-aa"))))
+ >>> df2.collect()
+ Traceback (most recent call last):
+ …
Review Comment:
ditto
##########
python/docs/source/development/debugging.rst:
##########
@@ -332,3 +332,273 @@ The UDF IDs can be seen in the query plan, for example,
``add1(...)#2L`` in ``Ar
This feature is not supported with registered UDFs.
+
+Common Exceptions / Errors
+--------------------------
+
+PySpark SQL
+~~~~~~~~~~~
+
+- **AnalysisException**
+
+AnalysisException is raised when failing to analyze a SQL query plan.
+
+Example:
+
+.. code-block:: python
+
+ >>> df = spark.range(1)
+ >>> df['bad_key']
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.AnalysisException: Cannot resolve column name "bad_key"
among (id)
+
+Solution:
+
+.. code-block:: python
+
+ >>> df['id']
+ Column<'id'>
+
+- **ParseException**
+
+ParseException is raised when failing to parse a SQL command.
+
+Example:
+
+.. code-block:: python
+
+ >>> spark.sql("select * 1")
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.ParseException:
+ Syntax error at or near '1': extra input '1'(line 1, pos 9)
+ == SQL ==
+ select * 1
+ ---------^^^
+
+Solution:
+
+.. code-block:: python
+
+ >>> spark.sql("select *")
+ DataFrame[]
+
+- **IllegalArgumentException**
+
+IllegalArgumentException is raised when passing an illegal or inappropriate
argument.
+
+Example:
+
+.. code-block:: python
+
+ >>> spark.range(1).sample(-1.0)
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.IllegalArgumentException: requirement failed: Sampling
fraction (-1.0) must be on interval [0, 1] without replacement
+
+Solution:
+
+.. code-block:: python
+
+ >>> spark.range(1).sample(1.0)
+ DataFrame[id: bigint]
+
+- **PythonException**
+
+PythonException is thrown from Python workers.
+
+You can see the type of exception that was thrown from the Python worker and
its stack trace, here “TypeError”.
+
+Example:
+
+.. code-block:: python
+
+ >>> from pyspark.sql.functions import udf
+ >>> def f(x):
+ ... return F.abs(x)
+ ...
+ >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+ 22/04/12 14:52:31 ERROR Executor: Exception in task 7.0 in stage 37.0 (TID
232)
+ org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
+ …
+ TypeError: Invalid argument, not a string or column: -1 of type <class
'int'>. For column literals, use 'lit', 'array', 'struct' or 'create_map'
function.
+
+Solution:
+
+.. code-block:: python
+
+ >>> def f(x):
+ ... return abs(x)
+ ...
+ >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+ [Row(id=-1, abs='1'), Row(id=0, abs='0')]
+
+- **StreamingQueryException**
+
+StreamingQueryException is raised when failing a StreamingQuery. Most often,
it is thrown from Python workers, that wrap it as a PythonException.
+
+Example:
+
+.. code-block:: python
+
+ >>> sdf =
spark.readStream.format("text").load("python/test_support/sql/streaming")
+ >>> from pyspark.sql.functions import col, udf
+ >>> bad_udf = udf(lambda x: 1 / 0)
+ >>>
(sdf.select(bad_udf(col("value"))).writeStream.format("memory").queryName("q1").start()).processAllAvailable()
+ Traceback (most recent call last):
+ …
+ org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
+ File "<stdin>", line 1, in <lambda>
+ ZeroDivisionError: division by zero
+ …
+ pyspark.sql.utils.StreamingQueryException: Query q1 [id =
ced5797c-74e2-4079-825b-f3316b327c7d, runId =
65bacaf3-9d51-476a-80ce-0ac388d4906a] terminated with exception: Writing job
aborted
+
+Solution:
+
+Fix the StreamingQuery and re-execute the workflow.
+
+- **SparkUpgradeException**
+
+SparkUpgradeException is thrown because of Spark upgrade.
+
+Example:
+
+.. code-block:: python
+
+ >>> from pyspark.sql.functions import to_date, unix_timestamp,
from_unixtime
+ >>> df = spark.createDataFrame([("2014-31-12",)], ["date_str"])
+ >>> df2 = df.select("date_str",
to_date(from_unixtime(unix_timestamp("date_str", "yyyy-dd-aa"))))
+ >>> df2.collect()
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.SparkUpgradeException: You may get a different result
due to the upgrading to Spark >= 3.0: Fail to recognize 'yyyy-dd-aa' pattern in
the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to
LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid
datetime pattern with the guide from
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
+
+Solution:
+
+.. code-block:: python
+
+ >>> spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
+ >>> df2 = df.select("date_str",
to_date(from_unixtime(unix_timestamp("date_str", "yyyy-dd-aa"))))
+ >>> df2.collect()
+ [Row(date_str='2014-31-12', to_date(from_unixtime(unix_timestamp(date_str,
yyyy-dd-aa), yyyy-MM-dd HH:mm:ss))=None)]
+
+pandas API on Spark
+~~~~~~~~~~~~~~~~~~~
+
+There are specific common exceptions / errors in pandas API on Spark.
+
+- **ValueError: Cannot combine the series or dataframe because it comes from a
different dataframe**
+
+Operations involving more than one series or dataframes raises a ValueError if
“compute.ops_on_diff_frames” is disabled (disabled by default). Such operations
may be expensive due to joining of underlying Spark frames. So users should be
aware of the cost and enable that flag only when necessary.
+
+Exception:
+
+.. code-block:: python
+
+ >>> ps.Series([1, 2]) + ps.Series([3, 4])
+ Traceback (most recent call last):
+ …
+ ValueError: Cannot combine the series or dataframe because it comes from a
different dataframe. In order to allow this operation, enable
'compute.ops_on_diff_frames' option.
+
+
+Solution:
+
+.. code-block:: python
+
+ >>> with ps.option_context('compute.ops_on_diff_frames', True):
+ ... ps.Series([1, 2]) + ps.Series([3, 4])
+ ...
+ 0 4
+ 1 6
+ dtype: int64
+
+- **RuntimeError: Result vector from pandas_udf was not the required length**
+
+Exception:
+
+.. code-block:: python
+
+ >>> def f(x) -> ps.Series[np.int32]:
+ ... return x[:-1]
+ ...
+ >>> ps.DataFrame({"x":[1, 2], "y":[3, 4]}).transform(f)
+ 22/04/12 13:46:39 ERROR Executor: Exception in task 2.0 in stage 16.0 (TID
88)
+ org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
+ …
Review Comment:
ditto
##########
python/docs/source/development/debugging.rst:
##########
@@ -332,3 +332,273 @@ The UDF IDs can be seen in the query plan, for example,
``add1(...)#2L`` in ``Ar
This feature is not supported with registered UDFs.
+
+Common Exceptions / Errors
+--------------------------
+
+PySpark SQL
+~~~~~~~~~~~
+
+- **AnalysisException**
+
+AnalysisException is raised when failing to analyze a SQL query plan.
+
+Example:
+
+.. code-block:: python
+
+ >>> df = spark.range(1)
+ >>> df['bad_key']
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.AnalysisException: Cannot resolve column name "bad_key"
among (id)
+
+Solution:
+
+.. code-block:: python
+
+ >>> df['id']
+ Column<'id'>
+
+- **ParseException**
+
+ParseException is raised when failing to parse a SQL command.
+
+Example:
+
+.. code-block:: python
+
+ >>> spark.sql("select * 1")
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.ParseException:
+ Syntax error at or near '1': extra input '1'(line 1, pos 9)
+ == SQL ==
+ select * 1
+ ---------^^^
+
+Solution:
+
+.. code-block:: python
+
+ >>> spark.sql("select *")
+ DataFrame[]
+
+- **IllegalArgumentException**
+
+IllegalArgumentException is raised when passing an illegal or inappropriate
argument.
+
+Example:
+
+.. code-block:: python
+
+ >>> spark.range(1).sample(-1.0)
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.IllegalArgumentException: requirement failed: Sampling
fraction (-1.0) must be on interval [0, 1] without replacement
+
+Solution:
+
+.. code-block:: python
+
+ >>> spark.range(1).sample(1.0)
+ DataFrame[id: bigint]
+
+- **PythonException**
+
+PythonException is thrown from Python workers.
+
+You can see the type of exception that was thrown from the Python worker and
its stack trace, here “TypeError”.
+
+Example:
+
+.. code-block:: python
+
+ >>> from pyspark.sql.functions import udf
+ >>> def f(x):
+ ... return F.abs(x)
+ ...
+ >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+ 22/04/12 14:52:31 ERROR Executor: Exception in task 7.0 in stage 37.0 (TID
232)
+ org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
+ …
+ TypeError: Invalid argument, not a string or column: -1 of type <class
'int'>. For column literals, use 'lit', 'array', 'struct' or 'create_map'
function.
+
+Solution:
+
+.. code-block:: python
+
+ >>> def f(x):
+ ... return abs(x)
+ ...
+ >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+ [Row(id=-1, abs='1'), Row(id=0, abs='0')]
+
+- **StreamingQueryException**
+
+StreamingQueryException is raised when failing a StreamingQuery. Most often,
it is thrown from Python workers, that wrap it as a PythonException.
+
+Example:
+
+.. code-block:: python
+
+ >>> sdf =
spark.readStream.format("text").load("python/test_support/sql/streaming")
+ >>> from pyspark.sql.functions import col, udf
+ >>> bad_udf = udf(lambda x: 1 / 0)
+ >>>
(sdf.select(bad_udf(col("value"))).writeStream.format("memory").queryName("q1").start()).processAllAvailable()
+ Traceback (most recent call last):
+ …
+ org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
+ File "<stdin>", line 1, in <lambda>
+ ZeroDivisionError: division by zero
+ …
Review Comment:
ditto
##########
python/docs/source/development/debugging.rst:
##########
@@ -332,3 +332,273 @@ The UDF IDs can be seen in the query plan, for example,
``add1(...)#2L`` in ``Ar
This feature is not supported with registered UDFs.
+
+Common Exceptions / Errors
+--------------------------
+
+PySpark SQL
+~~~~~~~~~~~
+
+- **AnalysisException**
+
+AnalysisException is raised when failing to analyze a SQL query plan.
+
+Example:
+
+.. code-block:: python
+
+ >>> df = spark.range(1)
+ >>> df['bad_key']
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.AnalysisException: Cannot resolve column name "bad_key"
among (id)
+
+Solution:
+
+.. code-block:: python
+
+ >>> df['id']
+ Column<'id'>
+
+- **ParseException**
+
+ParseException is raised when failing to parse a SQL command.
+
+Example:
+
+.. code-block:: python
+
+ >>> spark.sql("select * 1")
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.ParseException:
+ Syntax error at or near '1': extra input '1'(line 1, pos 9)
+ == SQL ==
+ select * 1
+ ---------^^^
+
+Solution:
+
+.. code-block:: python
+
+ >>> spark.sql("select *")
+ DataFrame[]
+
+- **IllegalArgumentException**
+
+IllegalArgumentException is raised when passing an illegal or inappropriate
argument.
+
+Example:
+
+.. code-block:: python
+
+ >>> spark.range(1).sample(-1.0)
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.IllegalArgumentException: requirement failed: Sampling
fraction (-1.0) must be on interval [0, 1] without replacement
+
+Solution:
+
+.. code-block:: python
+
+ >>> spark.range(1).sample(1.0)
+ DataFrame[id: bigint]
+
+- **PythonException**
+
+PythonException is thrown from Python workers.
+
+You can see the type of exception that was thrown from the Python worker and
its stack trace, here “TypeError”.
+
+Example:
+
+.. code-block:: python
+
+ >>> from pyspark.sql.functions import udf
+ >>> def f(x):
+ ... return F.abs(x)
+ ...
+ >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+ 22/04/12 14:52:31 ERROR Executor: Exception in task 7.0 in stage 37.0 (TID
232)
+ org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
+ …
Review Comment:
ditto
##########
python/docs/source/development/debugging.rst:
##########
@@ -332,3 +332,273 @@ The UDF IDs can be seen in the query plan, for example,
``add1(...)#2L`` in ``Ar
This feature is not supported with registered UDFs.
+
+Common Exceptions / Errors
+--------------------------
+
+PySpark SQL
+~~~~~~~~~~~
+
+- **AnalysisException**
+
+AnalysisException is raised when failing to analyze a SQL query plan.
+
+Example:
+
+.. code-block:: python
+
+ >>> df = spark.range(1)
+ >>> df['bad_key']
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.AnalysisException: Cannot resolve column name "bad_key"
among (id)
+
+Solution:
+
+.. code-block:: python
+
+ >>> df['id']
+ Column<'id'>
+
+- **ParseException**
+
+ParseException is raised when failing to parse a SQL command.
+
+Example:
+
+.. code-block:: python
+
+ >>> spark.sql("select * 1")
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.ParseException:
+ Syntax error at or near '1': extra input '1'(line 1, pos 9)
+ == SQL ==
+ select * 1
+ ---------^^^
+
+Solution:
+
+.. code-block:: python
+
+ >>> spark.sql("select *")
+ DataFrame[]
+
+- **IllegalArgumentException**
+
+IllegalArgumentException is raised when passing an illegal or inappropriate
argument.
+
+Example:
+
+.. code-block:: python
+
+ >>> spark.range(1).sample(-1.0)
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.IllegalArgumentException: requirement failed: Sampling
fraction (-1.0) must be on interval [0, 1] without replacement
+
+Solution:
+
+.. code-block:: python
+
+ >>> spark.range(1).sample(1.0)
+ DataFrame[id: bigint]
+
+- **PythonException**
+
+PythonException is thrown from Python workers.
+
+You can see the type of exception that was thrown from the Python worker and
its stack trace, here “TypeError”.
+
+Example:
+
+.. code-block:: python
+
+ >>> from pyspark.sql.functions import udf
+ >>> def f(x):
+ ... return F.abs(x)
+ ...
+ >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+ 22/04/12 14:52:31 ERROR Executor: Exception in task 7.0 in stage 37.0 (TID
232)
+ org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
+ …
+ TypeError: Invalid argument, not a string or column: -1 of type <class
'int'>. For column literals, use 'lit', 'array', 'struct' or 'create_map'
function.
+
+Solution:
+
+.. code-block:: python
+
+ >>> def f(x):
+ ... return abs(x)
+ ...
+ >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+ [Row(id=-1, abs='1'), Row(id=0, abs='0')]
+
+- **StreamingQueryException**
+
+StreamingQueryException is raised when failing a StreamingQuery. Most often,
it is thrown from Python workers, that wrap it as a PythonException.
+
+Example:
+
+.. code-block:: python
+
+ >>> sdf =
spark.readStream.format("text").load("python/test_support/sql/streaming")
+ >>> from pyspark.sql.functions import col, udf
+ >>> bad_udf = udf(lambda x: 1 / 0)
+ >>>
(sdf.select(bad_udf(col("value"))).writeStream.format("memory").queryName("q1").start()).processAllAvailable()
+ Traceback (most recent call last):
+ …
+ org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
+ File "<stdin>", line 1, in <lambda>
+ ZeroDivisionError: division by zero
+ …
+ pyspark.sql.utils.StreamingQueryException: Query q1 [id =
ced5797c-74e2-4079-825b-f3316b327c7d, runId =
65bacaf3-9d51-476a-80ce-0ac388d4906a] terminated with exception: Writing job
aborted
+
+Solution:
+
+Fix the StreamingQuery and re-execute the workflow.
+
+- **SparkUpgradeException**
+
+SparkUpgradeException is thrown because of Spark upgrade.
+
+Example:
+
+.. code-block:: python
+
+ >>> from pyspark.sql.functions import to_date, unix_timestamp,
from_unixtime
+ >>> df = spark.createDataFrame([("2014-31-12",)], ["date_str"])
+ >>> df2 = df.select("date_str",
to_date(from_unixtime(unix_timestamp("date_str", "yyyy-dd-aa"))))
+ >>> df2.collect()
+ Traceback (most recent call last):
+ …
+ pyspark.sql.utils.SparkUpgradeException: You may get a different result
due to the upgrading to Spark >= 3.0: Fail to recognize 'yyyy-dd-aa' pattern in
the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to
LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid
datetime pattern with the guide from
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
+
+Solution:
+
+.. code-block:: python
+
+ >>> spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
+ >>> df2 = df.select("date_str",
to_date(from_unixtime(unix_timestamp("date_str", "yyyy-dd-aa"))))
+ >>> df2.collect()
+ [Row(date_str='2014-31-12', to_date(from_unixtime(unix_timestamp(date_str,
yyyy-dd-aa), yyyy-MM-dd HH:mm:ss))=None)]
+
+pandas API on Spark
+~~~~~~~~~~~~~~~~~~~
+
+There are specific common exceptions / errors in pandas API on Spark.
+
+- **ValueError: Cannot combine the series or dataframe because it comes from a
different dataframe**
+
+Operations involving more than one series or dataframes raises a ValueError if
“compute.ops_on_diff_frames” is disabled (disabled by default). Such operations
may be expensive due to joining of underlying Spark frames. So users should be
aware of the cost and enable that flag only when necessary.
+
+Exception:
+
+.. code-block:: python
+
+ >>> ps.Series([1, 2]) + ps.Series([3, 4])
+ Traceback (most recent call last):
+ …
Review Comment:
ditto
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]