This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.3 by this push:
new 2672624931d [SPARK-38953][PYTHON][DOC] Document PySpark common
exceptions / errors
2672624931d is described below
commit 2672624931dd4784fad6cdd912e3669c83741060
Author: Xinrong Meng <[email protected]>
AuthorDate: Sun May 15 09:25:02 2022 +0900
[SPARK-38953][PYTHON][DOC] Document PySpark common exceptions / errors
### What changes were proposed in this pull request?
Document PySpark(SQL, pandas API on Spark, and Py4J) common
exceptions/errors and respective solutions.
### Why are the changes needed?
Make PySpark debugging easier.
There are common exceptions/errors in PySpark SQL, pandas API on Spark, and
Py4J.
Documenting exceptions and respective solutions help users debug PySpark.
### Does this PR introduce _any_ user-facing change?
No. Document change only.
### How was this patch tested?
Manual test.
<img width="1019" alt="image"
src="https://user-images.githubusercontent.com/47337188/165145874-b0de33b1-835a-459d-9062-94086e62e254.png">
Please refer to
https://github.com/apache/spark/blob/7a1c7599a21cbbe2778707b72643cf98ac601ab1/python/docs/source/development/debugging.rst#common-exceptions--errors
for the whole rendered page.
Closes #36267 from xinrong-databricks/common_err.
Authored-by: Xinrong Meng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit f940d7adfd6d071bc3bdcc406e01263a7f03e955)
Signed-off-by: Hyukjin Kwon <[email protected]>
---
python/docs/source/development/debugging.rst | 280 +++++++++++++++++++++++++++
1 file changed, 280 insertions(+)
diff --git a/python/docs/source/development/debugging.rst
b/python/docs/source/development/debugging.rst
index 1e6571da028..05c47ae4bf7 100644
--- a/python/docs/source/development/debugging.rst
+++ b/python/docs/source/development/debugging.rst
@@ -332,3 +332,283 @@ The UDF IDs can be seen in the query plan, for example,
``add1(...)#2L`` in ``Ar
This feature is not supported with registered UDFs.
+
+Common Exceptions / Errors
+--------------------------
+
+PySpark SQL
+~~~~~~~~~~~
+
+**AnalysisException**
+
+``AnalysisException`` is raised when failing to analyze a SQL query plan.
+
+Example:
+
+.. code-block:: python
+
+ >>> df = spark.range(1)
+ >>> df['bad_key']
+ Traceback (most recent call last):
+ ...
+ pyspark.sql.utils.AnalysisException: Cannot resolve column name "bad_key"
among (id)
+
+Solution:
+
+.. code-block:: python
+
+ >>> df['id']
+ Column<'id'>
+
+**ParseException**
+
+``ParseException`` is raised when failing to parse a SQL command.
+
+Example:
+
+.. code-block:: python
+
+ >>> spark.sql("select * 1")
+ Traceback (most recent call last):
+ ...
+ pyspark.sql.utils.ParseException:
+ Syntax error at or near '1': extra input '1'(line 1, pos 9)
+ == SQL ==
+ select * 1
+ ---------^^^
+
+Solution:
+
+.. code-block:: python
+
+ >>> spark.sql("select *")
+ DataFrame[]
+
+**IllegalArgumentException**
+
+``IllegalArgumentException`` is raised when passing an illegal or
inappropriate argument.
+
+Example:
+
+.. code-block:: python
+
+ >>> spark.range(1).sample(-1.0)
+ Traceback (most recent call last):
+ ...
+ pyspark.sql.utils.IllegalArgumentException: requirement failed: Sampling
fraction (-1.0) must be on interval [0, 1] without replacement
+
+Solution:
+
+.. code-block:: python
+
+ >>> spark.range(1).sample(1.0)
+ DataFrame[id: bigint]
+
+**PythonException**
+
+``PythonException`` is thrown from Python workers.
+
+You can see the type of exception that was thrown from the Python worker and
its stack trace, as ``TypeError`` below.
+
+Example:
+
+.. code-block:: python
+
+ >>> from pyspark.sql.functions import udf
+ >>> def f(x):
+ ... return F.abs(x)
+ ...
+ >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+ 22/04/12 14:52:31 ERROR Executor: Exception in task 7.0 in stage 37.0 (TID
232)
+ org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
+ ...
+ TypeError: Invalid argument, not a string or column: -1 of type <class
'int'>. For column literals, use 'lit', 'array', 'struct' or 'create_map'
function.
+
+Solution:
+
+.. code-block:: python
+
+ >>> def f(x):
+ ... return abs(x)
+ ...
+ >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+ [Row(id=-1, abs='1'), Row(id=0, abs='0')]
+
+**StreamingQueryException**
+
+``StreamingQueryException`` is raised when failing a StreamingQuery. Most
often, it is thrown from Python workers, that wrap it as a ``PythonException``.
+
+Example:
+
+.. code-block:: python
+
+ >>> sdf =
spark.readStream.format("text").load("python/test_support/sql/streaming")
+ >>> from pyspark.sql.functions import col, udf
+ >>> bad_udf = udf(lambda x: 1 / 0)
+ >>>
(sdf.select(bad_udf(col("value"))).writeStream.format("memory").queryName("q1").start()).processAllAvailable()
+ Traceback (most recent call last):
+ ...
+ org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
+ File "<stdin>", line 1, in <lambda>
+ ZeroDivisionError: division by zero
+ ...
+ pyspark.sql.utils.StreamingQueryException: Query q1 [id =
ced5797c-74e2-4079-825b-f3316b327c7d, runId =
65bacaf3-9d51-476a-80ce-0ac388d4906a] terminated with exception: Writing job
aborted
+
+Solution:
+
+Fix the StreamingQuery and re-execute the workflow.
+
+**SparkUpgradeException**
+
+``SparkUpgradeException`` is thrown because of Spark upgrade.
+
+Example:
+
+.. code-block:: python
+
+ >>> from pyspark.sql.functions import to_date, unix_timestamp,
from_unixtime
+ >>> df = spark.createDataFrame([("2014-31-12",)], ["date_str"])
+ >>> df2 = df.select("date_str",
to_date(from_unixtime(unix_timestamp("date_str", "yyyy-dd-aa"))))
+ >>> df2.collect()
+ Traceback (most recent call last):
+ ...
+ pyspark.sql.utils.SparkUpgradeException: You may get a different result
due to the upgrading to Spark >= 3.0: Fail to recognize 'yyyy-dd-aa' pattern in
the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to
LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid
datetime pattern with the guide from
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
+
+Solution:
+
+.. code-block:: python
+
+ >>> spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
+ >>> df2 = df.select("date_str",
to_date(from_unixtime(unix_timestamp("date_str", "yyyy-dd-aa"))))
+ >>> df2.collect()
+ [Row(date_str='2014-31-12', to_date(from_unixtime(unix_timestamp(date_str,
yyyy-dd-aa), yyyy-MM-dd HH:mm:ss))=None)]
+
+pandas API on Spark
+~~~~~~~~~~~~~~~~~~~
+
+There are specific common exceptions / errors in pandas API on Spark.
+
+**ValueError: Cannot combine the series or dataframe because it comes from a
different dataframe**
+
+Operations involving more than one series or dataframes raises a
``ValueError`` if ``compute.ops_on_diff_frames`` is disabled (disabled by
default). Such operations may be expensive due to joining of underlying Spark
frames. So users should be aware of the cost and enable that flag only when
necessary.
+
+Exception:
+
+.. code-block:: python
+
+ >>> ps.Series([1, 2]) + ps.Series([3, 4])
+ Traceback (most recent call last):
+ ...
+ ValueError: Cannot combine the series or dataframe because it comes from a
different dataframe. In order to allow this operation, enable
'compute.ops_on_diff_frames' option.
+
+
+Solution:
+
+.. code-block:: python
+
+ >>> with ps.option_context('compute.ops_on_diff_frames', True):
+ ... ps.Series([1, 2]) + ps.Series([3, 4])
+ ...
+ 0 4
+ 1 6
+ dtype: int64
+
+**RuntimeError: Result vector from pandas_udf was not the required length**
+
+Exception:
+
+.. code-block:: python
+
+ >>> def f(x) -> ps.Series[np.int32]:
+ ... return x[:-1]
+ ...
+ >>> ps.DataFrame({"x":[1, 2], "y":[3, 4]}).transform(f)
+ 22/04/12 13:46:39 ERROR Executor: Exception in task 2.0 in stage 16.0 (TID
88)
+ org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
+ ...
+ RuntimeError: Result vector from pandas_udf was not the required length:
expected 1, got 0
+
+Solution:
+
+.. code-block:: python
+
+ >>> def f(x) -> ps.Series[np.int32]:
+ ... return x
+ ...
+ >>> ps.DataFrame({"x":[1, 2], "y":[3, 4]}).transform(f)
+ x y
+ 0 1 3
+ 1 2 4
+
+Py4j
+~~~~
+
+**Py4JJavaError**
+
+``Py4JJavaError`` is raised when an exception occurs in the Java client code.
+You can see the type of exception that was thrown on the Java side and its
stack trace, as ``java.lang.NullPointerException`` below.
+
+Example:
+
+.. code-block:: python
+
+ >>> spark.sparkContext._jvm.java.lang.String(None)
+ Traceback (most recent call last):
+ ...
+ py4j.protocol.Py4JJavaError: An error occurred while calling
None.java.lang.String.
+ : java.lang.NullPointerException
+ ..
+
+Solution:
+
+.. code-block:: python
+
+ >>> spark.sparkContext._jvm.java.lang.String("x")
+ 'x'
+
+**Py4JError**
+
+``Py4JError`` is raised when any other error occurs such as when the Python
client program tries to access an object that no longer exists on the Java side.
+
+Example:
+
+.. code-block:: python
+
+ >>> from pyspark.ml.linalg import Vectors
+ >>> from pyspark.ml.regression import LinearRegression
+ >>> df = spark.createDataFrame(
+ ... [(1.0, 2.0, Vectors.dense(1.0)), (0.0, 2.0,
Vectors.sparse(1, [], []))],
+ ... ["label", "weight", "features"],
+ ... )
+ >>> lr = LinearRegression(
+ ... maxIter=1, regParam=0.0, solver="normal",
weightCol="weight", fitIntercept=False
+ ... )
+ >>> model = lr.fit(df)
+ >>> model
+ LinearRegressionModel: uid=LinearRegression_eb7bc1d4bf25, numFeatures=1
+ >>> model.__del__()
+ >>> model
+ Traceback (most recent call last):
+ ...
+ py4j.protocol.Py4JError: An error occurred while calling o531.toString.
Trace:
+ py4j.Py4JException: Target Object ID does not exist for this gateway :o531
+ ...
+
+Solution:
+
+Access an object that exists on the Java side.
+
+**Py4JNetworkError**
+
+``Py4JNetworkError`` is raised when a problem occurs during network transfer
(e.g., connection lost). In this case, we shall debug the network and rebuild
the connection.
+
+Stack Traces
+------------
+
+There are Spark configurations to control stack traces:
+
+- ``spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled`` is true by
default to simplify traceback from Python UDFs.
+
+- ``spark.sql.pyspark.jvmStacktrace.enabled`` is false by default to hide JVM
stacktrace and to show a Python-friendly exception only.
+
+Spark configurations above are independent from log level settings. Control
log levels through :meth:`pyspark.SparkContext.setLogLevel`.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]