[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

davies Fri, 26 Sep 2014 21:32:28 -0700

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/2556


    [SPARK-3478] [PySpark] Profile the Python tasks

    This patch add profiling support for PySpark, it will show the profiling 
results
    before the driver exits, here is one example:
    
    ```
    ============================================================
    Profile of RDD<id=3>
    ============================================================
             5146507 function calls (5146487 primitive calls) in 71.094 seconds
    
       Ordered by: internal time, cumulative time
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      5144576   68.331    0.000   68.331    0.000 statcounter.py:44(merge)
           20    2.735    0.137   71.071    3.554 statcounter.py:33(__init__)
           20    0.017    0.001    0.017    0.001 {cPickle.dumps}
         1024    0.003    0.000    0.003    0.000 t.py:16(<lambda>)
           20    0.001    0.000    0.001    0.000 {reduce}
           21    0.001    0.000    0.001    0.000 {cPickle.loads}
           20    0.001    0.000    0.001    0.000 copy_reg.py:95(_slotnames)
           41    0.001    0.000    0.001    0.000 serializers.py:461(read_int)
           40    0.001    0.000    0.002    0.000 serializers.py:179(_batched)
           62    0.000    0.000    0.000    0.000 {method 'read' of 'file' 
objects}
           20    0.000    0.000   71.072    3.554 rdd.py:863(<lambda>)
           20    0.000    0.000    0.001    0.000 
serializers.py:198(load_stream)
        40/20    0.000    0.000   71.072    3.554 rdd.py:2093(pipeline_func)
           41    0.000    0.000    0.002    0.000 
serializers.py:130(load_stream)
           40    0.000    0.000   71.072    1.777 rdd.py:304(func)
           20    0.000    0.000   71.094    3.555 worker.py:82(process)
    ```
    
    Also, use can show profile result manually by `sc.show_profiles()` or dump 
it into disk
    by `sc.dump_profiles(path)`, such as
    
    ```python
    >>> sc._conf.set("spark.python.profile", "true")
    >>> rdd = sc.parallelize(range(100)).map(str)
    >>> rdd.count()
    100
    >>> sc.show_profiles()
    ============================================================
    Profile of RDD<id=1>
    ============================================================
             284 function calls (276 primitive calls) in 0.001 seconds
    
       Ordered by: internal time, cumulative time
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            4    0.000    0.000    0.000    0.000 
serializers.py:198(load_stream)
            4    0.000    0.000    0.000    0.000 {reduce}
         12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
            4    0.000    0.000    0.000    0.000 {cPickle.loads}
            4    0.000    0.000    0.000    0.000 {cPickle.dumps}
          104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
            8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
           12    0.000    0.000    0.000    0.000 rdd.py:303(func)
    ```
    The profiling is disabled by default, can be enabled by 
"spark.python.profile=true".
    
    Also, users can dump the results into disks automatically for future 
analysis, by "spark.python.profile.dump=path_to_dump"
    
    This is bugfix of #2351 cc @JoshRosen 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark profiler

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2556.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2556
    
----
commit 4b20494ce4e5e287a09fee5df5e0684711258627
Author: Davies Liu <[email protected]>
Date:   2014-09-11T00:51:28Z

    add profile for python

commit 0a5b6ebcd38f13fa15721c56a9d96bd9000529f5
Author: Davies Liu <[email protected]>
Date:   2014-09-11T03:25:23Z

    fix Python UDF

commit 4f8309d7d8df18fb5f4da1d9f150d7606bf650c9
Author: Davies Liu <[email protected]>
Date:   2014-09-13T03:14:34Z

    address comment, add tests

commit dadee1a228b20d24e4a6b0a7d081f1b30f773988
Author: Davies Liu <[email protected]>
Date:   2014-09-13T04:51:33Z

    add docs string and clear profiles after show or dump

commit 15d6f18fd97422ff7bebf343383b7eca9ef433bc
Author: Davies Liu <[email protected]>
Date:   2014-09-13T05:09:06Z

    add docs for two configs

commit c23865c6307963f97420d9213d6fb26ab0163f0d
Author: Davies Liu <[email protected]>
Date:   2014-09-13T05:14:19Z

    Merge branch 'master' into profiler

commit 09d02c33496598533336a24e0c4ee84e3b6c5317
Author: Davies Liu <[email protected]>
Date:   2014-09-14T04:23:19Z

    Merge branch 'master' into profiler
    
    Conflicts:
        docs/configuration.md

commit 116d52a1251140282a2cd5c49ad928b219c759b5
Author: Davies Liu <[email protected]>
Date:   2014-09-17T17:14:53Z

    Merge branch 'master' of github.com:apache/spark into profiler
    
    Conflicts:
        python/pyspark/worker.py

commit fb9565b2afdd7fbaa1cc6cf4b1971fba2d9919b0
Author: Davies Liu <[email protected]>
Date:   2014-09-23T22:16:56Z

    Merge branch 'master' of github.com:apache/spark into profiler
    
    Conflicts:
        python/pyspark/worker.py

commit cba94639fa6e5c4b2cb26f3152ea80bffaf65cce
Author: Davies Liu <[email protected]>
Date:   2014-09-24T23:05:06Z

    move show_profiles and dump_profiles to SparkContext

commit 7a56c2420dd087cbe311d34fa81b5b9d22024b53
Author: Davies Liu <[email protected]>
Date:   2014-09-24T23:12:11Z

    bugfix

commit 2b0daf207384b7cbf15a180bb05985fb596e8281
Author: Davies Liu <[email protected]>
Date:   2014-09-24T23:13:25Z

    fix docs

commit 7ef2aa05cf07b2648cb73cd05f2ece93a44d9b9a
Author: Davies Liu <[email protected]>
Date:   2014-09-25T21:47:49Z

    bugfix, add tests for show_profiles and dump_profiles()

commit 858e74caf5063e43fe7621716bc3e2048321ea00
Author: Davies Liu <[email protected]>
Date:   2014-09-27T04:29:40Z

    compatitable with python 2.6

commit e68df5a2ada0044f76d748f4e5dd250a1928812b
Author: Davies Liu <[email protected]>
Date:   2014-09-27T04:30:11Z

    Merge branch 'master' of github.com:apache/spark into profiler

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

Reply via email to