GitHub user kxepal opened a pull request:
https://github.com/apache/spark/pull/17671
[SPARK-20368][PYSPARK] Provide optional support for Sentry on PySpark
workers
## What changes were proposed in this pull request?
### Rationale
PySpark allows to use Python functions as UDF and for common
transformations like `map` or `filter` calls. Unfortunately, code may
contains bugs which leads to exceptions. Some Python exceptions are
quite easy to understand and fix, some of them requires to understand
overall function context. For instance:
```
TypeError: 'NoneType' object is not subscriptable
```
Ok, we eventually trying to access `None` value by index or key, but
why this value became `None`? That was not in our plans. To understand
why, to reproduce the problem, you'd like to see how this function was
called and in which state were all it locals.
Sentry is the one of the systems which captures, stores and classifies
tracebacks allowing to easily understand what had gone wrong and quite
popular among Python developers.
Suddenly, project-wide Sentry configuration cannot be applied to those
functions since they are get executed on the remotely, outside project
context. So either every function must have a special capture handler
or let the PySpark worker take care about everything.
### Motivation
While we have this patch applied, locally, I'd like to propose it for
upstream. Currently, we have to patch PySpark for every release.
Suddenly, we cannot just patch a single file since we have also ensure
that this patch will get into pyspark.zip archive, which will be
deployed to executors. Suddenly, there is no way found to have a plugin
for PySpark worker to avoid any patches.
### Known concerns
1. This adds support for one of many bug tracking systems. That's true.
The reason "why Sentry" is that it's very popular system among Python
developers and most of them are familiar with. I personally didn't
heard about else ones used by Python developers, but if there will
be many of them wanted to support PySpark, we can develop something
more plug-able solution.
### Possible alternatives
You can wrap ALL your functions which will be executed remotely on
executors with some decorator, which will provide same Sentry support
or throw much more verbose traceback, extracting locals via `inspect`
module. This was found as very inconvenient way since you'll have to
always wrap all your functions. Easy to forget to do.
### How to use
1. You need to have Sentry client (called raven) available on executors.
It may be installed there via system package manager or passed via
`sc.addPyFile` as an egg.
2. Pass Sentry DSN via SparkConfig as executor environment variable
like:
```
spark.conf.set('spark.executorEnv.SENTRY_DSN', '__DSN__')
```
Additionally, you can configure project release, environment, tags and
rest bits via Sentry's
environment variables:
- SENTRY_ENVIRONMENT - Optional, provide environment your application is
running in, like `production`
- SENTRY_EXTRA_TAGS - Optional, provide tag names to be extracted from
MDC, like `foo,bar,baz`
- SENTRY_RELEASE - Optional, provide release version of your
application, like `1.0.0`
- SENTRY_TAGS - Optional, provide tags like `tag1:value1,tag2:value2`
3. Follow the rest Sentry documentation how to use Sentry if you're not
familiar with.
## How was this patch tested?
This patch tested manually on local infrastructure.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/kxepal/spark
20368-sentry-support-on-pyspark-workers
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17671.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17671
----
commit 8e9206f2a1c34847efe943afe51b5bdde7298914
Author: Alexander Shorin <[email protected]>
Date: 2017-04-17T13:25:39Z
Provide optional support for Sentry on PySpark workers
SPARK-20368
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]