[GitHub] spark pull request #17671: [SPARK-20368][PYSPARK] Provide optional support f...

kxepal Tue, 18 Apr 2017 08:29:10 -0700

GitHub user kxepal opened a pull request:

    https://github.com/apache/spark/pull/17671


    [SPARK-20368][PYSPARK] Provide optional support for Sentry on PySpark 
workers

    ## What changes were proposed in this pull request?
    
    ### Rationale
    
    PySpark allows to use Python functions as UDF and for common 
    transformations like `map` or `filter` calls. Unfortunately, code may
    contains bugs which leads to exceptions. Some Python exceptions are 
    quite easy to understand and fix, some of them requires to understand
    overall function context. For instance:
    
    ```
    TypeError: 'NoneType' object is not subscriptable
    ```
    
    Ok, we eventually trying to access `None` value by index or key, but
    why this value became `None`? That was not in our plans. To understand
    why, to reproduce the problem, you'd like to see how this function was
    called and in which state were all it locals.
    
    Sentry is the one of the systems which captures, stores and classifies 
    tracebacks allowing to easily understand what had gone wrong and quite
    popular among Python developers.
    
    Suddenly, project-wide Sentry configuration cannot be applied to those
    functions since they are get executed on the remotely, outside project
    context. So either every function must have a special capture handler
    or let the PySpark worker take care about everything.
    
    ### Motivation
    
    While we have this patch applied, locally, I'd like to propose it for
    upstream. Currently, we have to patch PySpark for every release.
    Suddenly, we cannot just patch a single file since we have also ensure
    that this patch will get into pyspark.zip archive, which will be 
    deployed to executors. Suddenly, there is no way found to have a plugin 
    for PySpark worker to avoid any patches.
    
    ### Known concerns
    
    1. This adds support for one of many bug tracking systems. That's true.
       The reason "why Sentry" is that it's very popular system among Python
       developers and most of them are familiar with. I personally didn't
       heard about else ones used by Python developers, but if there will
       be many of them wanted to support PySpark, we can develop something
       more plug-able solution.
    
    ### Possible alternatives
    
    You can wrap ALL your functions which will be executed remotely on 
    executors with some decorator, which will provide same Sentry support
    or throw much more verbose traceback, extracting locals via `inspect`
    module. This was found as very inconvenient way since you'll have to 
    always wrap all your functions. Easy to forget to do.
    
    ### How to use
    
    1. You need to have Sentry client (called raven) available on executors. 
    It may be installed there via system package manager or passed via 
    `sc.addPyFile` as an egg. 
    
    2. Pass Sentry DSN via SparkConfig as executor environment variable 
       like:
       ```
       spark.conf.set('spark.executorEnv.SENTRY_DSN', '__DSN__')
       ```
       Additionally, you can configure project release, environment, tags and 
rest bits via Sentry's 
       environment variables:
       - SENTRY_ENVIRONMENT - Optional, provide environment your application is 
running in, like `production`
       - SENTRY_EXTRA_TAGS - Optional, provide tag names to be extracted from 
MDC, like `foo,bar,baz`
       - SENTRY_RELEASE - Optional, provide release version of your 
application, like `1.0.0`
       - SENTRY_TAGS - Optional, provide tags like `tag1:value1,tag2:value2`
     
    3. Follow the rest Sentry documentation how to use Sentry if you're not 
    familiar with. 
    
    ## How was this patch tested?
    
    This patch tested manually on local infrastructure.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kxepal/spark 
20368-sentry-support-on-pyspark-workers

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17671.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17671
    
----
commit 8e9206f2a1c34847efe943afe51b5bdde7298914
Author: Alexander Shorin <[email protected]>
Date:   2017-04-17T13:25:39Z

    Provide optional support for Sentry on PySpark workers
    
    SPARK-20368

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #17671: [SPARK-20368][PYSPARK] Provide optional support f...

Reply via email to