GitHub user davies opened a pull request:
https://github.com/apache/spark/pull/2556
[SPARK-3478] [PySpark] Profile the Python tasks
This patch add profiling support for PySpark, it will show the profiling
results
before the driver exits, here is one example:
```
============================================================
Profile of RDD<id=3>
============================================================
5146507 function calls (5146487 primitive calls) in 71.094 seconds
Ordered by: internal time, cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
5144576 68.331 0.000 68.331 0.000 statcounter.py:44(merge)
20 2.735 0.137 71.071 3.554 statcounter.py:33(__init__)
20 0.017 0.001 0.017 0.001 {cPickle.dumps}
1024 0.003 0.000 0.003 0.000 t.py:16(<lambda>)
20 0.001 0.000 0.001 0.000 {reduce}
21 0.001 0.000 0.001 0.000 {cPickle.loads}
20 0.001 0.000 0.001 0.000 copy_reg.py:95(_slotnames)
41 0.001 0.000 0.001 0.000 serializers.py:461(read_int)
40 0.001 0.000 0.002 0.000 serializers.py:179(_batched)
62 0.000 0.000 0.000 0.000 {method 'read' of 'file'
objects}
20 0.000 0.000 71.072 3.554 rdd.py:863(<lambda>)
20 0.000 0.000 0.001 0.000
serializers.py:198(load_stream)
40/20 0.000 0.000 71.072 3.554 rdd.py:2093(pipeline_func)
41 0.000 0.000 0.002 0.000
serializers.py:130(load_stream)
40 0.000 0.000 71.072 1.777 rdd.py:304(func)
20 0.000 0.000 71.094 3.555 worker.py:82(process)
```
Also, use can show profile result manually by `sc.show_profiles()` or dump
it into disk
by `sc.dump_profiles(path)`, such as
```python
>>> sc._conf.set("spark.python.profile", "true")
>>> rdd = sc.parallelize(range(100)).map(str)
>>> rdd.count()
100
>>> sc.show_profiles()
============================================================
Profile of RDD<id=1>
============================================================
284 function calls (276 primitive calls) in 0.001 seconds
Ordered by: internal time, cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
4 0.000 0.000 0.000 0.000
serializers.py:198(load_stream)
4 0.000 0.000 0.000 0.000 {reduce}
12/4 0.000 0.000 0.001 0.000 rdd.py:2092(pipeline_func)
4 0.000 0.000 0.000 0.000 {cPickle.loads}
4 0.000 0.000 0.000 0.000 {cPickle.dumps}
104 0.000 0.000 0.000 0.000 rdd.py:852(<genexpr>)
8 0.000 0.000 0.000 0.000 serializers.py:461(read_int)
12 0.000 0.000 0.000 0.000 rdd.py:303(func)
```
The profiling is disabled by default, can be enabled by
"spark.python.profile=true".
Also, users can dump the results into disks automatically for future
analysis, by "spark.python.profile.dump=path_to_dump"
This is bugfix of #2351 cc @JoshRosen
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/davies/spark profiler
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2556.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2556
----
commit 4b20494ce4e5e287a09fee5df5e0684711258627
Author: Davies Liu <[email protected]>
Date: 2014-09-11T00:51:28Z
add profile for python
commit 0a5b6ebcd38f13fa15721c56a9d96bd9000529f5
Author: Davies Liu <[email protected]>
Date: 2014-09-11T03:25:23Z
fix Python UDF
commit 4f8309d7d8df18fb5f4da1d9f150d7606bf650c9
Author: Davies Liu <[email protected]>
Date: 2014-09-13T03:14:34Z
address comment, add tests
commit dadee1a228b20d24e4a6b0a7d081f1b30f773988
Author: Davies Liu <[email protected]>
Date: 2014-09-13T04:51:33Z
add docs string and clear profiles after show or dump
commit 15d6f18fd97422ff7bebf343383b7eca9ef433bc
Author: Davies Liu <[email protected]>
Date: 2014-09-13T05:09:06Z
add docs for two configs
commit c23865c6307963f97420d9213d6fb26ab0163f0d
Author: Davies Liu <[email protected]>
Date: 2014-09-13T05:14:19Z
Merge branch 'master' into profiler
commit 09d02c33496598533336a24e0c4ee84e3b6c5317
Author: Davies Liu <[email protected]>
Date: 2014-09-14T04:23:19Z
Merge branch 'master' into profiler
Conflicts:
docs/configuration.md
commit 116d52a1251140282a2cd5c49ad928b219c759b5
Author: Davies Liu <[email protected]>
Date: 2014-09-17T17:14:53Z
Merge branch 'master' of github.com:apache/spark into profiler
Conflicts:
python/pyspark/worker.py
commit fb9565b2afdd7fbaa1cc6cf4b1971fba2d9919b0
Author: Davies Liu <[email protected]>
Date: 2014-09-23T22:16:56Z
Merge branch 'master' of github.com:apache/spark into profiler
Conflicts:
python/pyspark/worker.py
commit cba94639fa6e5c4b2cb26f3152ea80bffaf65cce
Author: Davies Liu <[email protected]>
Date: 2014-09-24T23:05:06Z
move show_profiles and dump_profiles to SparkContext
commit 7a56c2420dd087cbe311d34fa81b5b9d22024b53
Author: Davies Liu <[email protected]>
Date: 2014-09-24T23:12:11Z
bugfix
commit 2b0daf207384b7cbf15a180bb05985fb596e8281
Author: Davies Liu <[email protected]>
Date: 2014-09-24T23:13:25Z
fix docs
commit 7ef2aa05cf07b2648cb73cd05f2ece93a44d9b9a
Author: Davies Liu <[email protected]>
Date: 2014-09-25T21:47:49Z
bugfix, add tests for show_profiles and dump_profiles()
commit 858e74caf5063e43fe7621716bc3e2048321ea00
Author: Davies Liu <[email protected]>
Date: 2014-09-27T04:29:40Z
compatitable with python 2.6
commit e68df5a2ada0044f76d748f4e5dd250a1928812b
Author: Davies Liu <[email protected]>
Date: 2014-09-27T04:30:11Z
Merge branch 'master' of github.com:apache/spark into profiler
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]