[spark] branch branch-3.3 updated: [SPARK-38953][PYTHON][DOC] Document PySpark common exceptions / errors

gurwls223 Sat, 14 May 2022 17:25:36 -0700

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-3.3 by this push:
     new 2672624931d [SPARK-38953][PYTHON][DOC] Document PySpark common 
exceptions / errors
2672624931d is described below

commit 2672624931dd4784fad6cdd912e3669c83741060
Author: Xinrong Meng <[email protected]>
AuthorDate: Sun May 15 09:25:02 2022 +0900

    [SPARK-38953][PYTHON][DOC] Document PySpark common exceptions / errors
    
    ### What changes were proposed in this pull request?
    Document PySpark(SQL, pandas API on Spark, and Py4J) common 
exceptions/errors and respective solutions.
    
    ### Why are the changes needed?
    Make PySpark debugging easier.
    
    There are common exceptions/errors in PySpark SQL, pandas API on Spark, and 
Py4J.
    
    Documenting exceptions and respective solutions help users debug PySpark.
    
    ### Does this PR introduce _any_ user-facing change?
    No. Document change only.
    
    ### How was this patch tested?
    Manual test.
    
    <img width="1019" alt="image" 
src="https://user-images.githubusercontent.com/47337188/165145874-b0de33b1-835a-459d-9062-94086e62e254.png";>
    
    Please refer to 
https://github.com/apache/spark/blob/7a1c7599a21cbbe2778707b72643cf98ac601ab1/python/docs/source/development/debugging.rst#common-exceptions--errors
 for the whole rendered page.
    
    Closes #36267 from xinrong-databricks/common_err.
    
    Authored-by: Xinrong Meng <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    (cherry picked from commit f940d7adfd6d071bc3bdcc406e01263a7f03e955)
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 python/docs/source/development/debugging.rst | 280 +++++++++++++++++++++++++++
 1 file changed, 280 insertions(+)

diff --git a/python/docs/source/development/debugging.rst 
b/python/docs/source/development/debugging.rst
index 1e6571da028..05c47ae4bf7 100644
--- a/python/docs/source/development/debugging.rst
+++ b/python/docs/source/development/debugging.rst
@@ -332,3 +332,283 @@ The UDF IDs can be seen in the query plan, for example, 
``add1(...)#2L`` in ``Ar
 
 
 This feature is not supported with registered UDFs.
+
+Common Exceptions / Errors
+--------------------------
+
+PySpark SQL
+~~~~~~~~~~~
+
+**AnalysisException**
+
+``AnalysisException`` is raised when failing to analyze a SQL query plan.
+
+Example:
+
+.. code-block:: python
+
+    >>> df = spark.range(1)
+    >>> df['bad_key']
+    Traceback (most recent call last):
+    ...
+    pyspark.sql.utils.AnalysisException: Cannot resolve column name "bad_key" 
among (id)
+
+Solution:
+
+.. code-block:: python
+
+    >>> df['id']
+    Column<'id'>
+
+**ParseException**
+
+``ParseException`` is raised when failing to parse a SQL command.
+
+Example:
+
+.. code-block:: python
+
+    >>> spark.sql("select * 1")
+    Traceback (most recent call last):
+    ...
+    pyspark.sql.utils.ParseException:
+    Syntax error at or near '1': extra input '1'(line 1, pos 9)
+    == SQL ==
+    select * 1
+    ---------^^^
+
+Solution:
+
+.. code-block:: python
+
+    >>> spark.sql("select *")
+    DataFrame[]
+
+**IllegalArgumentException**
+
+``IllegalArgumentException`` is raised when passing an illegal or 
inappropriate argument.
+
+Example:
+
+.. code-block:: python
+
+    >>> spark.range(1).sample(-1.0)
+    Traceback (most recent call last):
+    ...
+    pyspark.sql.utils.IllegalArgumentException: requirement failed: Sampling 
fraction (-1.0) must be on interval [0, 1] without replacement
+
+Solution:
+
+.. code-block:: python
+
+    >>> spark.range(1).sample(1.0)
+    DataFrame[id: bigint]
+
+**PythonException**
+
+``PythonException`` is thrown from Python workers.
+
+You can see the type of exception that was thrown from the Python worker and 
its stack trace, as ``TypeError`` below.
+
+Example:
+
+.. code-block:: python
+
+    >>> from pyspark.sql.functions import udf
+    >>> def f(x):
+    ...   return F.abs(x)
+    ...
+    >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+    22/04/12 14:52:31 ERROR Executor: Exception in task 7.0 in stage 37.0 (TID 
232)
+    org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
+    ...
+    TypeError: Invalid argument, not a string or column: -1 of type <class 
'int'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' 
function.
+
+Solution:
+
+.. code-block:: python
+
+    >>> def f(x):
+    ...   return abs(x)
+    ...
+    >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect()
+    [Row(id=-1, abs='1'), Row(id=0, abs='0')]
+
+**StreamingQueryException**
+
+``StreamingQueryException`` is raised when failing a StreamingQuery. Most 
often, it is thrown from Python workers, that wrap it as a ``PythonException``.
+
+Example:
+
+.. code-block:: python
+
+    >>> sdf = 
spark.readStream.format("text").load("python/test_support/sql/streaming")
+    >>> from pyspark.sql.functions import col, udf
+    >>> bad_udf = udf(lambda x: 1 / 0)
+    >>> 
(sdf.select(bad_udf(col("value"))).writeStream.format("memory").queryName("q1").start()).processAllAvailable()
+    Traceback (most recent call last):
+    ...
+    org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
+      File "<stdin>", line 1, in <lambda>
+    ZeroDivisionError: division by zero
+    ...
+    pyspark.sql.utils.StreamingQueryException: Query q1 [id = 
ced5797c-74e2-4079-825b-f3316b327c7d, runId = 
65bacaf3-9d51-476a-80ce-0ac388d4906a] terminated with exception: Writing job 
aborted
+
+Solution:
+
+Fix the StreamingQuery and re-execute the workflow.
+
+**SparkUpgradeException**
+
+``SparkUpgradeException`` is thrown because of Spark upgrade.
+
+Example:
+
+.. code-block:: python
+
+    >>> from pyspark.sql.functions import to_date, unix_timestamp, 
from_unixtime
+    >>> df = spark.createDataFrame([("2014-31-12",)], ["date_str"])
+    >>> df2 = df.select("date_str", 
to_date(from_unixtime(unix_timestamp("date_str", "yyyy-dd-aa"))))
+    >>> df2.collect()
+    Traceback (most recent call last):
+    ...
+    pyspark.sql.utils.SparkUpgradeException: You may get a different result 
due to the upgrading to Spark >= 3.0: Fail to recognize 'yyyy-dd-aa' pattern in 
the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to 
LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid 
datetime pattern with the guide from 
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
+
+Solution:
+
+.. code-block:: python
+
+    >>> spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
+    >>> df2 = df.select("date_str", 
to_date(from_unixtime(unix_timestamp("date_str", "yyyy-dd-aa"))))
+    >>> df2.collect()
+    [Row(date_str='2014-31-12', to_date(from_unixtime(unix_timestamp(date_str, 
yyyy-dd-aa), yyyy-MM-dd HH:mm:ss))=None)]
+
+pandas API on Spark
+~~~~~~~~~~~~~~~~~~~
+
+There are specific common exceptions / errors in pandas API on Spark.
+
+**ValueError: Cannot combine the series or dataframe because it comes from a 
different dataframe**
+
+Operations involving more than one series or dataframes raises a 
``ValueError`` if ``compute.ops_on_diff_frames`` is disabled (disabled by 
default). Such operations may be expensive due to joining of underlying Spark 
frames. So users should be aware of the cost and enable that flag only when 
necessary.
+
+Exception:
+
+.. code-block:: python
+
+    >>> ps.Series([1, 2]) + ps.Series([3, 4])
+    Traceback (most recent call last):
+    ...
+    ValueError: Cannot combine the series or dataframe because it comes from a 
different dataframe. In order to allow this operation, enable 
'compute.ops_on_diff_frames' option.
+
+
+Solution:
+
+.. code-block:: python
+
+    >>> with ps.option_context('compute.ops_on_diff_frames', True):
+    ...     ps.Series([1, 2]) + ps.Series([3, 4])
+    ...
+    0    4
+    1    6
+    dtype: int64
+
+**RuntimeError: Result vector from pandas_udf was not the required length**
+
+Exception:
+
+.. code-block:: python
+
+    >>> def f(x) -> ps.Series[np.int32]:
+    ...   return x[:-1]
+    ...
+    >>> ps.DataFrame({"x":[1, 2], "y":[3, 4]}).transform(f)
+    22/04/12 13:46:39 ERROR Executor: Exception in task 2.0 in stage 16.0 (TID 
88)
+    org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
+    ...
+    RuntimeError: Result vector from pandas_udf was not the required length: 
expected 1, got 0
+
+Solution:
+
+.. code-block:: python
+
+    >>> def f(x) -> ps.Series[np.int32]:
+    ...   return x
+    ...
+    >>> ps.DataFrame({"x":[1, 2], "y":[3, 4]}).transform(f)
+       x  y
+    0  1  3
+    1  2  4
+
+Py4j
+~~~~
+
+**Py4JJavaError**
+
+``Py4JJavaError`` is raised when an exception occurs in the Java client code.
+You can see the type of exception that was thrown on the Java side and its 
stack trace, as ``java.lang.NullPointerException`` below.
+
+Example:
+
+.. code-block:: python
+
+    >>> spark.sparkContext._jvm.java.lang.String(None)
+    Traceback (most recent call last):
+    ...
+    py4j.protocol.Py4JJavaError: An error occurred while calling 
None.java.lang.String.
+    : java.lang.NullPointerException
+    ..
+
+Solution:
+
+.. code-block:: python
+
+    >>> spark.sparkContext._jvm.java.lang.String("x")
+    'x'
+
+**Py4JError**
+
+``Py4JError`` is raised when any other error occurs such as when the Python 
client program tries to access an object that no longer exists on the Java side.
+
+Example:
+
+.. code-block:: python
+
+    >>> from pyspark.ml.linalg import Vectors
+    >>> from pyspark.ml.regression import LinearRegression
+    >>> df = spark.createDataFrame(
+    ...             [(1.0, 2.0, Vectors.dense(1.0)), (0.0, 2.0, 
Vectors.sparse(1, [], []))],
+    ...             ["label", "weight", "features"],
+    ...         )
+    >>> lr = LinearRegression(
+    ...             maxIter=1, regParam=0.0, solver="normal", 
weightCol="weight", fitIntercept=False
+    ...         )
+    >>> model = lr.fit(df)
+    >>> model
+    LinearRegressionModel: uid=LinearRegression_eb7bc1d4bf25, numFeatures=1
+    >>> model.__del__()
+    >>> model
+    Traceback (most recent call last):
+    ...
+    py4j.protocol.Py4JError: An error occurred while calling o531.toString. 
Trace:
+    py4j.Py4JException: Target Object ID does not exist for this gateway :o531
+    ...
+
+Solution:
+
+Access an object that exists on the Java side.
+
+**Py4JNetworkError**
+
+``Py4JNetworkError`` is raised when a problem occurs during network transfer 
(e.g., connection lost). In this case, we shall debug the network and rebuild 
the connection.
+
+Stack Traces
+------------
+
+There are Spark configurations to control stack traces:
+
+- ``spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled`` is true by 
default to simplify traceback from Python UDFs.
+
+- ``spark.sql.pyspark.jvmStacktrace.enabled`` is false by default to hide JVM 
stacktrace and to show a Python-friendly exception only.
+
+Spark configurations above are independent from log level settings. Control 
log levels through :meth:`pyspark.SparkContext.setLogLevel`.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[spark] branch branch-3.3 updated: [SPARK-38953][PYTHON][DOC] Document PySpark common exceptions / errors

Reply via email to