[GitHub] [spark] itholic commented on a diff in pull request #36267: [SPARK-38953][PYTHON][DOC] Document PySpark common exceptions / errors

GitBox Wed, 20 Apr 2022 17:48:21 -0700


itholic commented on code in PR #36267:
URL: https://github.com/apache/spark/pull/36267#discussion_r854676432



##########
python/docs/source/development/debugging.rst:
##########
@@ -332,3 +332,273 @@ The UDF IDs can be seen in the query plan, for example, 
``add1(...)#2L`` in ``Ar
 
 
 This feature is not supported with registered UDFs.
+
+Common Exceptions / Errors
+--------------------------
+
+PySpark SQL
+~~~~~~~~~~~
+
+- **AnalysisException**
+
+AnalysisException is raised when failing to analyze a SQL query plan.
+
+Example:
+
+.. code-block:: python
+
+    >>> df = spark.range(1)
+    >>> df['bad_key']
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.AnalysisException: Cannot resolve column name "bad_key" 
among (id)
+
+Solution:
+
+.. code-block:: python
+
+    >>> df['id']
+    Column<'id'>
+
+- **ParseException**
+
+ParseException is raised when failing to parse a SQL command.
+
+Example:
+
+.. code-block:: python
+
+    >>> spark.sql("select * 1")
+    Traceback (most recent call last):
+    …

Review Comment:
   ditto



##########
python/docs/source/development/debugging.rst:
##########
@@ -332,3 +332,273 @@ The UDF IDs can be seen in the query plan, for example, 
``add1(...)#2L`` in ``Ar
 
 
 This feature is not supported with registered UDFs.
+
+Common Exceptions / Errors
+--------------------------
+
+PySpark SQL
+~~~~~~~~~~~
+
+- **AnalysisException**
+
+AnalysisException is raised when failing to analyze a SQL query plan.
+
+Example:
+
+.. code-block:: python
+
+    >>> df = spark.range(1)
+    >>> df['bad_key']
+    Traceback (most recent call last):
+    …

Review Comment:
   Seems like we should change `…` to `...`.



##########
python/docs/source/development/debugging.rst:
##########
@@ -332,3 +332,273 @@ The UDF IDs can be seen in the query plan, for example, 
``add1(...)#2L`` in ``Ar
 
 
 This feature is not supported with registered UDFs.
+
+Common Exceptions / Errors
+--------------------------
+
+PySpark SQL
+~~~~~~~~~~~
+
+- **AnalysisException**
+
+AnalysisException is raised when failing to analyze a SQL query plan.
+
+Example:
+
+.. code-block:: python
+
+    >>> df = spark.range(1)
+    >>> df['bad_key']
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.AnalysisException: Cannot resolve column name "bad_key" 
among (id)
+
+Solution:
+
+.. code-block:: python
+
+    >>> df['id']
+    Column<'id'>
+
+- **ParseException**
+
+ParseException is raised when failing to parse a SQL command.
+
+Example:
+
+.. code-block:: python
+
+    >>> spark.sql("select * 1")
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.ParseException:
+    Syntax error at or near '1': extra input '1'(line 1, pos 9)
+    == SQL ==
+    select * 1
+    ---------^^^
+
+Solution:
+
+.. code-block:: python
+
+    >>> spark.sql("select *")
+    DataFrame[]
+
+- **IllegalArgumentException**
+
+IllegalArgumentException is raised when passing an illegal or inappropriate 
argument.
+
+Example:
+
+.. code-block:: python
+
+    >>> spark.range(1).sample(-1.0)
+    Traceback (most recent call last):
+    …

Review Comment:
   ditto



##########
python/docs/source/development/debugging.rst:
##########
@@ -332,3 +332,273 @@ The UDF IDs can be seen in the query plan, for example, 
``add1(...)#2L`` in ``Ar
 
 
 This feature is not supported with registered UDFs.
+
+Common Exceptions / Errors
+--------------------------
+
+PySpark SQL
+~~~~~~~~~~~
+
+- **AnalysisException**
+
+AnalysisException is raised when failing to analyze a SQL query plan.
+
+Example:
+
+.. code-block:: python
+
+    >>> df = spark.range(1)
+    >>> df['bad_key']
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.AnalysisException: Cannot resolve column name "bad_key" 
among (id)
+
+Solution:
+
+.. code-block:: python
+
+    >>> df['id']
+    Column<'id'>
+
+- **ParseException**
+
+ParseException is raised when failing to parse a SQL command.
+
+Example:
+
+.. code-block:: python
+
+    >>> spark.sql("select * 1")
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.ParseException:
+    Syntax error at or near '1': extra input '1'(line 1, pos 9)
+    == SQL ==
+    select * 1
+    ---------^^^
+
+Solution:
+
+.. code-block:: python
+
+    >>> spark.sql("select *")
+    DataFrame[]
+
+- **IllegalArgumentException**
+
+IllegalArgumentException is raised when passing an illegal or inappropriate 
argument.
+
+Example:
+
+.. code-block:: python
+
+    >>> spark.range(1).sample(-1.0)
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.IllegalArgumentException: requirement failed: Sampling 
fraction (-1.0) must be on interval [0, 1] without replacement
+
+Solution:
+
+.. code-block:: python
+
+    >>> spark.range(1).sample(1.0)
+    DataFrame[id: bigint]
+
+- **PythonException**
+
+PythonException is thrown from Python workers.
+
+You can see the type of exception that was thrown from the Python worker and 
its stack trace, here “TypeError”.
+
+Example:
+
+.. code-block:: python
+
+    >>> from pyspark.sql.functions import udf
+    >>> def f(x):
+    ...   return F.abs(x)
+    ...
+    >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+    22/04/12 14:52:31 ERROR Executor: Exception in task 7.0 in stage 37.0 (TID 
232)
+    org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
+    …
+    TypeError: Invalid argument, not a string or column: -1 of type <class 
'int'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' 
function.
+
+Solution:
+
+.. code-block:: python
+
+    >>> def f(x):
+    ...   return abs(x)
+    ...
+    >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+    [Row(id=-1, abs='1'), Row(id=0, abs='0')]
+
+- **StreamingQueryException**
+
+StreamingQueryException is raised when failing a StreamingQuery. Most often, 
it is thrown from Python workers, that wrap it as a PythonException.
+
+Example:
+
+.. code-block:: python
+
+    >>> sdf = 
spark.readStream.format("text").load("python/test_support/sql/streaming")
+    >>> from pyspark.sql.functions import col, udf
+    >>> bad_udf = udf(lambda x: 1 / 0)
+    >>> 
(sdf.select(bad_udf(col("value"))).writeStream.format("memory").queryName("q1").start()).processAllAvailable()
+    Traceback (most recent call last):
+    …

Review Comment:
   ditto



##########
python/docs/source/development/debugging.rst:
##########
@@ -332,3 +332,273 @@ The UDF IDs can be seen in the query plan, for example, 
``add1(...)#2L`` in ``Ar
 
 
 This feature is not supported with registered UDFs.
+
+Common Exceptions / Errors
+--------------------------
+
+PySpark SQL
+~~~~~~~~~~~
+
+- **AnalysisException**
+
+AnalysisException is raised when failing to analyze a SQL query plan.
+
+Example:
+
+.. code-block:: python
+
+    >>> df = spark.range(1)
+    >>> df['bad_key']
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.AnalysisException: Cannot resolve column name "bad_key" 
among (id)
+
+Solution:
+
+.. code-block:: python
+
+    >>> df['id']
+    Column<'id'>
+
+- **ParseException**
+
+ParseException is raised when failing to parse a SQL command.
+
+Example:
+
+.. code-block:: python
+
+    >>> spark.sql("select * 1")
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.ParseException:
+    Syntax error at or near '1': extra input '1'(line 1, pos 9)
+    == SQL ==
+    select * 1
+    ---------^^^
+
+Solution:
+
+.. code-block:: python
+
+    >>> spark.sql("select *")
+    DataFrame[]
+
+- **IllegalArgumentException**
+
+IllegalArgumentException is raised when passing an illegal or inappropriate 
argument.
+
+Example:
+
+.. code-block:: python
+
+    >>> spark.range(1).sample(-1.0)
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.IllegalArgumentException: requirement failed: Sampling 
fraction (-1.0) must be on interval [0, 1] without replacement
+
+Solution:
+
+.. code-block:: python
+
+    >>> spark.range(1).sample(1.0)
+    DataFrame[id: bigint]
+
+- **PythonException**
+
+PythonException is thrown from Python workers.
+
+You can see the type of exception that was thrown from the Python worker and 
its stack trace, here “TypeError”.
+
+Example:
+
+.. code-block:: python
+
+    >>> from pyspark.sql.functions import udf
+    >>> def f(x):
+    ...   return F.abs(x)
+    ...
+    >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+    22/04/12 14:52:31 ERROR Executor: Exception in task 7.0 in stage 37.0 (TID 
232)
+    org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
+    …
+    TypeError: Invalid argument, not a string or column: -1 of type <class 
'int'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' 
function.
+
+Solution:
+
+.. code-block:: python
+
+    >>> def f(x):
+    ...   return abs(x)
+    ...
+    >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+    [Row(id=-1, abs='1'), Row(id=0, abs='0')]
+
+- **StreamingQueryException**
+
+StreamingQueryException is raised when failing a StreamingQuery. Most often, 
it is thrown from Python workers, that wrap it as a PythonException.
+
+Example:
+
+.. code-block:: python
+
+    >>> sdf = 
spark.readStream.format("text").load("python/test_support/sql/streaming")
+    >>> from pyspark.sql.functions import col, udf
+    >>> bad_udf = udf(lambda x: 1 / 0)
+    >>> 
(sdf.select(bad_udf(col("value"))).writeStream.format("memory").queryName("q1").start()).processAllAvailable()
+    Traceback (most recent call last):
+    …
+    org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
+      File "<stdin>", line 1, in <lambda>
+    ZeroDivisionError: division by zero
+    …
+    pyspark.sql.utils.StreamingQueryException: Query q1 [id = 
ced5797c-74e2-4079-825b-f3316b327c7d, runId = 
65bacaf3-9d51-476a-80ce-0ac388d4906a] terminated with exception: Writing job 
aborted
+
+Solution:
+
+Fix the StreamingQuery and re-execute the workflow.
+
+- **SparkUpgradeException**
+
+SparkUpgradeException is thrown because of Spark upgrade.
+
+Example:
+
+.. code-block:: python
+
+    >>> from pyspark.sql.functions import to_date, unix_timestamp, 
from_unixtime
+    >>> df = spark.createDataFrame([("2014-31-12",)], ["date_str"])
+    >>> df2 = df.select("date_str", 
to_date(from_unixtime(unix_timestamp("date_str", "yyyy-dd-aa"))))
+    >>> df2.collect()
+    Traceback (most recent call last):
+    …

Review Comment:
   ditto



##########
python/docs/source/development/debugging.rst:
##########
@@ -332,3 +332,273 @@ The UDF IDs can be seen in the query plan, for example, 
``add1(...)#2L`` in ``Ar
 
 
 This feature is not supported with registered UDFs.
+
+Common Exceptions / Errors
+--------------------------
+
+PySpark SQL
+~~~~~~~~~~~
+
+- **AnalysisException**
+
+AnalysisException is raised when failing to analyze a SQL query plan.
+
+Example:
+
+.. code-block:: python
+
+    >>> df = spark.range(1)
+    >>> df['bad_key']
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.AnalysisException: Cannot resolve column name "bad_key" 
among (id)
+
+Solution:
+
+.. code-block:: python
+
+    >>> df['id']
+    Column<'id'>
+
+- **ParseException**
+
+ParseException is raised when failing to parse a SQL command.
+
+Example:
+
+.. code-block:: python
+
+    >>> spark.sql("select * 1")
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.ParseException:
+    Syntax error at or near '1': extra input '1'(line 1, pos 9)
+    == SQL ==
+    select * 1
+    ---------^^^
+
+Solution:
+
+.. code-block:: python
+
+    >>> spark.sql("select *")
+    DataFrame[]
+
+- **IllegalArgumentException**
+
+IllegalArgumentException is raised when passing an illegal or inappropriate 
argument.
+
+Example:
+
+.. code-block:: python
+
+    >>> spark.range(1).sample(-1.0)
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.IllegalArgumentException: requirement failed: Sampling 
fraction (-1.0) must be on interval [0, 1] without replacement
+
+Solution:
+
+.. code-block:: python
+
+    >>> spark.range(1).sample(1.0)
+    DataFrame[id: bigint]
+
+- **PythonException**
+
+PythonException is thrown from Python workers.
+
+You can see the type of exception that was thrown from the Python worker and 
its stack trace, here “TypeError”.
+
+Example:
+
+.. code-block:: python
+
+    >>> from pyspark.sql.functions import udf
+    >>> def f(x):
+    ...   return F.abs(x)
+    ...
+    >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+    22/04/12 14:52:31 ERROR Executor: Exception in task 7.0 in stage 37.0 (TID 
232)
+    org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
+    …
+    TypeError: Invalid argument, not a string or column: -1 of type <class 
'int'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' 
function.
+
+Solution:
+
+.. code-block:: python
+
+    >>> def f(x):
+    ...   return abs(x)
+    ...
+    >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+    [Row(id=-1, abs='1'), Row(id=0, abs='0')]
+
+- **StreamingQueryException**
+
+StreamingQueryException is raised when failing a StreamingQuery. Most often, 
it is thrown from Python workers, that wrap it as a PythonException.
+
+Example:
+
+.. code-block:: python
+
+    >>> sdf = 
spark.readStream.format("text").load("python/test_support/sql/streaming")
+    >>> from pyspark.sql.functions import col, udf
+    >>> bad_udf = udf(lambda x: 1 / 0)
+    >>> 
(sdf.select(bad_udf(col("value"))).writeStream.format("memory").queryName("q1").start()).processAllAvailable()
+    Traceback (most recent call last):
+    …
+    org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
+      File "<stdin>", line 1, in <lambda>
+    ZeroDivisionError: division by zero
+    …
+    pyspark.sql.utils.StreamingQueryException: Query q1 [id = 
ced5797c-74e2-4079-825b-f3316b327c7d, runId = 
65bacaf3-9d51-476a-80ce-0ac388d4906a] terminated with exception: Writing job 
aborted
+
+Solution:
+
+Fix the StreamingQuery and re-execute the workflow.
+
+- **SparkUpgradeException**
+
+SparkUpgradeException is thrown because of Spark upgrade.
+
+Example:
+
+.. code-block:: python
+
+    >>> from pyspark.sql.functions import to_date, unix_timestamp, 
from_unixtime
+    >>> df = spark.createDataFrame([("2014-31-12",)], ["date_str"])
+    >>> df2 = df.select("date_str", 
to_date(from_unixtime(unix_timestamp("date_str", "yyyy-dd-aa"))))
+    >>> df2.collect()
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.SparkUpgradeException: You may get a different result 
due to the upgrading to Spark >= 3.0: Fail to recognize 'yyyy-dd-aa' pattern in 
the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to 
LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid 
datetime pattern with the guide from 
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
+
+Solution:
+
+.. code-block:: python
+
+    >>> spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
+    >>> df2 = df.select("date_str", 
to_date(from_unixtime(unix_timestamp("date_str", "yyyy-dd-aa"))))
+    >>> df2.collect()
+    [Row(date_str='2014-31-12', to_date(from_unixtime(unix_timestamp(date_str, 
yyyy-dd-aa), yyyy-MM-dd HH:mm:ss))=None)]
+
+pandas API on Spark
+~~~~~~~~~~~~~~~~~~~
+
+There are specific common exceptions / errors in pandas API on Spark.
+
+- **ValueError: Cannot combine the series or dataframe because it comes from a 
different dataframe**
+
+Operations involving more than one series or dataframes raises a ValueError if 
“compute.ops_on_diff_frames” is disabled (disabled by default). Such operations 
may be expensive due to joining of underlying Spark frames. So users should be 
aware of the cost and enable that flag only when necessary.
+
+Exception:
+
+.. code-block:: python
+
+    >>> ps.Series([1, 2]) + ps.Series([3, 4])
+    Traceback (most recent call last):
+    …
+    ValueError: Cannot combine the series or dataframe because it comes from a 
different dataframe. In order to allow this operation, enable 
'compute.ops_on_diff_frames' option.
+
+
+Solution:
+
+.. code-block:: python
+
+    >>> with ps.option_context('compute.ops_on_diff_frames', True):
+    ...     ps.Series([1, 2]) + ps.Series([3, 4])
+    ...
+    0    4
+    1    6
+    dtype: int64
+
+- **RuntimeError: Result vector from pandas_udf was not the required length**
+
+Exception:
+
+.. code-block:: python
+
+    >>> def f(x) -> ps.Series[np.int32]:
+    ...   return x[:-1]
+    ...
+    >>> ps.DataFrame({"x":[1, 2], "y":[3, 4]}).transform(f)
+    22/04/12 13:46:39 ERROR Executor: Exception in task 2.0 in stage 16.0 (TID 
88)
+    org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
+    …

Review Comment:
   ditto



##########
python/docs/source/development/debugging.rst:
##########
@@ -332,3 +332,273 @@ The UDF IDs can be seen in the query plan, for example, 
``add1(...)#2L`` in ``Ar
 
 
 This feature is not supported with registered UDFs.
+
+Common Exceptions / Errors
+--------------------------
+
+PySpark SQL
+~~~~~~~~~~~
+
+- **AnalysisException**
+
+AnalysisException is raised when failing to analyze a SQL query plan.
+
+Example:
+
+.. code-block:: python
+
+    >>> df = spark.range(1)
+    >>> df['bad_key']
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.AnalysisException: Cannot resolve column name "bad_key" 
among (id)
+
+Solution:
+
+.. code-block:: python
+
+    >>> df['id']
+    Column<'id'>
+
+- **ParseException**
+
+ParseException is raised when failing to parse a SQL command.
+
+Example:
+
+.. code-block:: python
+
+    >>> spark.sql("select * 1")
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.ParseException:
+    Syntax error at or near '1': extra input '1'(line 1, pos 9)
+    == SQL ==
+    select * 1
+    ---------^^^
+
+Solution:
+
+.. code-block:: python
+
+    >>> spark.sql("select *")
+    DataFrame[]
+
+- **IllegalArgumentException**
+
+IllegalArgumentException is raised when passing an illegal or inappropriate 
argument.
+
+Example:
+
+.. code-block:: python
+
+    >>> spark.range(1).sample(-1.0)
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.IllegalArgumentException: requirement failed: Sampling 
fraction (-1.0) must be on interval [0, 1] without replacement
+
+Solution:
+
+.. code-block:: python
+
+    >>> spark.range(1).sample(1.0)
+    DataFrame[id: bigint]
+
+- **PythonException**
+
+PythonException is thrown from Python workers.
+
+You can see the type of exception that was thrown from the Python worker and 
its stack trace, here “TypeError”.
+
+Example:
+
+.. code-block:: python
+
+    >>> from pyspark.sql.functions import udf
+    >>> def f(x):
+    ...   return F.abs(x)
+    ...
+    >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+    22/04/12 14:52:31 ERROR Executor: Exception in task 7.0 in stage 37.0 (TID 
232)
+    org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
+    …
+    TypeError: Invalid argument, not a string or column: -1 of type <class 
'int'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' 
function.
+
+Solution:
+
+.. code-block:: python
+
+    >>> def f(x):
+    ...   return abs(x)
+    ...
+    >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+    [Row(id=-1, abs='1'), Row(id=0, abs='0')]
+
+- **StreamingQueryException**
+
+StreamingQueryException is raised when failing a StreamingQuery. Most often, 
it is thrown from Python workers, that wrap it as a PythonException.
+
+Example:
+
+.. code-block:: python
+
+    >>> sdf = 
spark.readStream.format("text").load("python/test_support/sql/streaming")
+    >>> from pyspark.sql.functions import col, udf
+    >>> bad_udf = udf(lambda x: 1 / 0)
+    >>> 
(sdf.select(bad_udf(col("value"))).writeStream.format("memory").queryName("q1").start()).processAllAvailable()
+    Traceback (most recent call last):
+    …
+    org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
+      File "<stdin>", line 1, in <lambda>
+    ZeroDivisionError: division by zero
+    …

Review Comment:
   ditto



##########
python/docs/source/development/debugging.rst:
##########
@@ -332,3 +332,273 @@ The UDF IDs can be seen in the query plan, for example, 
``add1(...)#2L`` in ``Ar
 
 
 This feature is not supported with registered UDFs.
+
+Common Exceptions / Errors
+--------------------------
+
+PySpark SQL
+~~~~~~~~~~~
+
+- **AnalysisException**
+
+AnalysisException is raised when failing to analyze a SQL query plan.
+
+Example:
+
+.. code-block:: python
+
+    >>> df = spark.range(1)
+    >>> df['bad_key']
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.AnalysisException: Cannot resolve column name "bad_key" 
among (id)
+
+Solution:
+
+.. code-block:: python
+
+    >>> df['id']
+    Column<'id'>
+
+- **ParseException**
+
+ParseException is raised when failing to parse a SQL command.
+
+Example:
+
+.. code-block:: python
+
+    >>> spark.sql("select * 1")
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.ParseException:
+    Syntax error at or near '1': extra input '1'(line 1, pos 9)
+    == SQL ==
+    select * 1
+    ---------^^^
+
+Solution:
+
+.. code-block:: python
+
+    >>> spark.sql("select *")
+    DataFrame[]
+
+- **IllegalArgumentException**
+
+IllegalArgumentException is raised when passing an illegal or inappropriate 
argument.
+
+Example:
+
+.. code-block:: python
+
+    >>> spark.range(1).sample(-1.0)
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.IllegalArgumentException: requirement failed: Sampling 
fraction (-1.0) must be on interval [0, 1] without replacement
+
+Solution:
+
+.. code-block:: python
+
+    >>> spark.range(1).sample(1.0)
+    DataFrame[id: bigint]
+
+- **PythonException**
+
+PythonException is thrown from Python workers.
+
+You can see the type of exception that was thrown from the Python worker and 
its stack trace, here “TypeError”.
+
+Example:
+
+.. code-block:: python
+
+    >>> from pyspark.sql.functions import udf
+    >>> def f(x):
+    ...   return F.abs(x)
+    ...
+    >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+    22/04/12 14:52:31 ERROR Executor: Exception in task 7.0 in stage 37.0 (TID 
232)
+    org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
+    …

Review Comment:
   ditto



##########
python/docs/source/development/debugging.rst:
##########
@@ -332,3 +332,273 @@ The UDF IDs can be seen in the query plan, for example, 
``add1(...)#2L`` in ``Ar
 
 
 This feature is not supported with registered UDFs.
+
+Common Exceptions / Errors
+--------------------------
+
+PySpark SQL
+~~~~~~~~~~~
+
+- **AnalysisException**
+
+AnalysisException is raised when failing to analyze a SQL query plan.
+
+Example:
+
+.. code-block:: python
+
+    >>> df = spark.range(1)
+    >>> df['bad_key']
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.AnalysisException: Cannot resolve column name "bad_key" 
among (id)
+
+Solution:
+
+.. code-block:: python
+
+    >>> df['id']
+    Column<'id'>
+
+- **ParseException**
+
+ParseException is raised when failing to parse a SQL command.
+
+Example:
+
+.. code-block:: python
+
+    >>> spark.sql("select * 1")
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.ParseException:
+    Syntax error at or near '1': extra input '1'(line 1, pos 9)
+    == SQL ==
+    select * 1
+    ---------^^^
+
+Solution:
+
+.. code-block:: python
+
+    >>> spark.sql("select *")
+    DataFrame[]
+
+- **IllegalArgumentException**
+
+IllegalArgumentException is raised when passing an illegal or inappropriate 
argument.
+
+Example:
+
+.. code-block:: python
+
+    >>> spark.range(1).sample(-1.0)
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.IllegalArgumentException: requirement failed: Sampling 
fraction (-1.0) must be on interval [0, 1] without replacement
+
+Solution:
+
+.. code-block:: python
+
+    >>> spark.range(1).sample(1.0)
+    DataFrame[id: bigint]
+
+- **PythonException**
+
+PythonException is thrown from Python workers.
+
+You can see the type of exception that was thrown from the Python worker and 
its stack trace, here “TypeError”.
+
+Example:
+
+.. code-block:: python
+
+    >>> from pyspark.sql.functions import udf
+    >>> def f(x):
+    ...   return F.abs(x)
+    ...
+    >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+    22/04/12 14:52:31 ERROR Executor: Exception in task 7.0 in stage 37.0 (TID 
232)
+    org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
+    …
+    TypeError: Invalid argument, not a string or column: -1 of type <class 
'int'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' 
function.
+
+Solution:
+
+.. code-block:: python
+
+    >>> def f(x):
+    ...   return abs(x)
+    ...
+    >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+    [Row(id=-1, abs='1'), Row(id=0, abs='0')]
+
+- **StreamingQueryException**
+
+StreamingQueryException is raised when failing a StreamingQuery. Most often, 
it is thrown from Python workers, that wrap it as a PythonException.
+
+Example:
+
+.. code-block:: python
+
+    >>> sdf = 
spark.readStream.format("text").load("python/test_support/sql/streaming")
+    >>> from pyspark.sql.functions import col, udf
+    >>> bad_udf = udf(lambda x: 1 / 0)
+    >>> 
(sdf.select(bad_udf(col("value"))).writeStream.format("memory").queryName("q1").start()).processAllAvailable()
+    Traceback (most recent call last):
+    …
+    org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
+      File "<stdin>", line 1, in <lambda>
+    ZeroDivisionError: division by zero
+    …
+    pyspark.sql.utils.StreamingQueryException: Query q1 [id = 
ced5797c-74e2-4079-825b-f3316b327c7d, runId = 
65bacaf3-9d51-476a-80ce-0ac388d4906a] terminated with exception: Writing job 
aborted
+
+Solution:
+
+Fix the StreamingQuery and re-execute the workflow.
+
+- **SparkUpgradeException**
+
+SparkUpgradeException is thrown because of Spark upgrade.
+
+Example:
+
+.. code-block:: python
+
+    >>> from pyspark.sql.functions import to_date, unix_timestamp, 
from_unixtime
+    >>> df = spark.createDataFrame([("2014-31-12",)], ["date_str"])
+    >>> df2 = df.select("date_str", 
to_date(from_unixtime(unix_timestamp("date_str", "yyyy-dd-aa"))))
+    >>> df2.collect()
+    Traceback (most recent call last):
+    …
+    pyspark.sql.utils.SparkUpgradeException: You may get a different result 
due to the upgrading to Spark >= 3.0: Fail to recognize 'yyyy-dd-aa' pattern in 
the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to 
LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid 
datetime pattern with the guide from 
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
+
+Solution:
+
+.. code-block:: python
+
+    >>> spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
+    >>> df2 = df.select("date_str", 
to_date(from_unixtime(unix_timestamp("date_str", "yyyy-dd-aa"))))
+    >>> df2.collect()
+    [Row(date_str='2014-31-12', to_date(from_unixtime(unix_timestamp(date_str, 
yyyy-dd-aa), yyyy-MM-dd HH:mm:ss))=None)]
+
+pandas API on Spark
+~~~~~~~~~~~~~~~~~~~
+
+There are specific common exceptions / errors in pandas API on Spark.
+
+- **ValueError: Cannot combine the series or dataframe because it comes from a 
different dataframe**
+
+Operations involving more than one series or dataframes raises a ValueError if 
“compute.ops_on_diff_frames” is disabled (disabled by default). Such operations 
may be expensive due to joining of underlying Spark frames. So users should be 
aware of the cost and enable that flag only when necessary.
+
+Exception:
+
+.. code-block:: python
+
+    >>> ps.Series([1, 2]) + ps.Series([3, 4])
+    Traceback (most recent call last):
+    …

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] itholic commented on a diff in pull request #36267: [SPARK-38953][PYTHON][DOC] Document PySpark common exceptions / errors

Reply via email to