[ https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15382113#comment-15382113 ]
Maciej Bryński edited comment on SPARK-16321 at 7/18/16 2:12 PM: ----------------------------------------------------------------- OK. I did tests with VisualVM and Python profiler enabled I set following options: {code} spark.driver.memory 8g spark.memory.fraction 0.6 # 3 executors spark.executor.memory 30g spark.executor.cores 15 # server has 40 cores so there is enough CPU for both Scala and Python spark.executor.extraJavaOptions -XX:NewRatio=4 -XX:+UseParallelOldGC -XX:-UseCompressedOops -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=4048 -Dcom.sun.management.jmxremote.rmi.port=4047 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false {code} I think those GC options eliminated GC overhead problem. *Total run time:* Spark 2.0 : 3,3 min Spark 1.6: 2,3 min I attached screenshots from VisualVM from one of the executors (both 1.6 and 2.0 version). As you can see there are 2 main problems: 1) PythonRunner time 2) Parquet reader time PythonRunner time could indicate problems with Python code so I paste profiles from Python BasicProfiler. And this is the moment that I don't understand results. >From JVM point of view Spark 2.0 is 50% slower >(org.apache.spark.api.python.PythonRunner$$anon$1.read() method) >From Python point of view Spark 2.0 is slightly faster. So where could be a problem ? Spark 1.6 Profiler {code} ============================================================ Profile of RDD<id=11> ============================================================ 3860611667 function calls (3529336059 primitive calls) in 4360.151 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 29503 480.019 0.016 3968.411 0.135 {built-in method loads} 1588051 420.956 0.000 420.956 0.000 decoder.py:349(raw_decode) 14602755 362.611 0.000 912.538 0.000 types.py:565(<listcomp>) 59406 319.737 0.005 319.737 0.005 {method 'read' of '_io.BufferedReader' objects} 143753067/43088753 310.714 0.000 1263.782 0.000 types.py:424(fromJson) 689592988 309.769 0.000 549.927 0.000 types.py:437(fromInternal) 143753067 303.711 0.000 366.807 0.000 types.py:394(__init__) 71409609 268.747 0.000 480.577 0.000 types.py:1162(_create_row) 146639250/1588051 181.296 0.000 1376.390 0.001 types.py:736(_parse_datatype_json_value) 723603860 179.262 0.000 179.262 0.000 {built-in method isinstance} 149238722/71409609 146.782 0.000 1547.714 0.000 types.py:558(fromInternal) 71409609 99.052 0.000 99.052 0.000 {built-in method __new__ of type object at 0x9d1c40} 5453542/1588051 95.142 0.000 1288.681 0.001 types.py:527(<listcomp>) 139887576 67.573 0.000 67.573 0.000 {method 'keys' of 'dict' objects} 615371111 64.788 0.000 64.788 0.000 types.py:87(fromInternal) 71409609 64.314 0.000 64.314 0.000 types.py:1280(__setattr__) 400 63.731 0.159 4360.136 10.900 serializers.py:259(dump_stream) 71409609 53.644 0.000 1601.358 0.000 types.py:1159(<lambda>) 120860802 51.438 0.000 115.739 0.000 types.py:466(<genexpr>) 149206609 50.887 0.000 64.772 0.000 types.py:464(<genexpr>) 115407260 50.515 0.000 64.300 0.000 types.py:431(needConversion) 71409609 48.464 0.000 147.516 0.000 types.py:1194(__new__) 1588051 39.248 0.000 1859.901 0.001 types.py:684(_parse_datatype_json_string) 8915977 29.611 0.000 51.700 0.000 types.py:322(<listcomp>) 71409609 27.133 0.000 27.133 0.000 types.py:1158(_create_row_inbound_converter) 5453542/1588051 26.184 0.000 1371.215 0.001 types.py:525(fromJson) 5453542 23.354 0.000 264.314 0.000 types.py:446(__init__) 5453542 22.546 0.000 22.546 0.000 types.py:463(<listcomp>) 2354475 22.168 0.000 22.168 0.000 {built-in method fromtimestamp} 5453542 21.537 0.000 86.309 0.000 {built-in method all} 19161994 19.031 0.000 88.270 0.000 types.py:319(fromInternal) 5453542 16.367 0.000 131.358 0.000 {built-in method any} 19815964 15.244 0.000 29.920 0.000 types.py:174(fromInternal) 19250503 14.890 0.000 17.730 0.000 types.py:311(needConversion) 12500825 14.676 0.000 14.676 0.000 {built-in method fromordinal} 114485251 13.383 0.000 13.383 0.000 types.py:73(needConversion) 2538382 9.649 0.000 40.856 0.000 types.py:194(fromInternal) 2354475 9.039 0.000 9.039 0.000 {method 'replace' of 'datetime.datetime' objects} 4352388 8.644 0.000 8.644 0.000 {method 'match' of '_sre.SRE_Pattern' objects} 1588051 7.681 0.000 436.389 0.000 decoder.py:338(decode) 1588051 5.644 0.000 444.263 0.000 __init__.py:271(loads) 19293583 2.852 0.000 2.852 0.000 types.py:529(needConversion) 1298132 2.834 0.000 413.293 0.000 types.py:306(fromJson) 873033 2.422 0.000 6.821 0.000 <ipython-input-12-1ecd485d1a19>:2(<lambda>) 2461092 2.322 0.000 2.322 0.000 {method 'startswith' of 'str' objects} 873041 1.813 0.000 4.399 0.000 types.py:1267(__getattr__) 1298132 1.751 0.000 2.307 0.000 types.py:283(__init__) 873041 1.671 0.000 1.821 0.000 types.py:1254(__getitem__) 588143 1.171 0.000 1.171 0.000 types.py:216(__init__) 3176102 1.014 0.000 1.014 0.000 {method 'end' of '_sre.SRE_Match' objects} 1176286 0.700 0.000 0.700 0.000 {method 'group' of '_sre.SRE_Match' objects} 29903 0.457 0.000 4289.348 0.143 serializers.py:155(_read_with_length) 1617562 0.447 0.000 0.447 0.000 {built-in method len} 29903 0.371 0.000 300.548 0.010 serializers.py:542(read_int) 873041 0.340 0.000 0.340 0.000 {method 'index' of 'list' objects} 29903 0.236 0.000 4289.584 0.143 serializers.py:136(load_stream) 29503 0.227 0.000 3968.639 0.135 serializers.py:418(loads) 29903 0.130 0.000 0.130 0.000 {built-in method unpack} 407054 0.104 0.000 0.104 0.000 types.py:185(needConversion) 383366 0.093 0.000 0.093 0.000 types.py:167(needConversion) 400 0.006 0.000 4360.150 10.900 worker.py:104(process) 400 0.005 0.000 0.007 0.000 serializers.py:217(load_stream) 400 0.002 0.000 0.002 0.000 rdd.py:303(func) 400 0.001 0.000 0.001 0.000 serializers.py:220(_load_stream_without_unbatching) 800 0.000 0.000 0.000 0.000 {built-in method from_iterable} 4 0.000 0.000 0.000 0.000 {built-in method dumps} 400 0.000 0.000 0.000 0.000 {built-in method iter} 400 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 4 0.000 0.000 0.000 0.000 serializers.py:414(dumps) 4 0.000 0.000 0.000 0.000 serializers.py:549(write_int) 8 0.000 0.000 0.000 0.000 {method 'write' of '_io.BufferedWriter' objects} 4 0.000 0.000 0.000 0.000 {built-in method pack} {code} Spark 2.0 Profiler {code} ============================================================ Profile of RDD<id=9> ============================================================ 4015274806 function calls (3683999000 primitive calls) in 3824.303 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 29503 447.179 0.015 3581.490 0.121 {built-in method loads} 1588069 343.160 0.000 343.160 0.000 decoder.py:349(raw_decode) 14602755 331.047 0.000 837.768 0.000 types.py:600(<listcomp>) 689592988 285.243 0.000 506.721 0.000 types.py:438(fromInternal) 143753265 276.530 0.000 348.146 0.000 types.py:394(__init__) 143753265/43088951 261.310 0.000 1137.452 0.000 types.py:425(fromJson) 71409609 238.990 0.000 430.747 0.000 types.py:1354(_create_row) 867357953 181.193 0.000 181.193 0.000 {built-in method isinstance} 59406 176.652 0.003 176.652 0.003 {method 'read' of '_io.BufferedReader' objects} 146639466/1588069 168.921 0.000 1243.338 0.001 types.py:891(_parse_datatype_json_value) 149238722/71409609 136.504 0.000 1413.990 0.000 types.py:593(fromInternal) 71409609 84.433 0.000 84.433 0.000 {built-in method __new__ of type object at 0x9d1c40} 5453560/1588069 81.277 0.000 1160.041 0.001 types.py:562(<listcomp>) 615371111 61.250 0.000 61.250 0.000 types.py:87(fromInternal) 71409609 61.204 0.000 61.204 0.000 types.py:1483(__setattr__) 400 58.462 0.146 3824.286 9.561 serializers.py:259(dump_stream) 139887774 51.188 0.000 51.188 0.000 {method 'keys' of 'dict' objects} 71409609 51.071 0.000 1465.061 0.000 types.py:1351(<lambda>) 120861018 47.346 0.000 106.776 0.000 types.py:476(<genexpr>) 149206825 46.725 0.000 59.287 0.000 types.py:474(<genexpr>) 115407458 46.443 0.000 59.430 0.000 types.py:432(needConversion) 71409609 46.120 0.000 130.553 0.000 types.py:1400(__new__) 1588069 35.083 0.000 1643.525 0.001 types.py:842(_parse_datatype_json_string) 8915977 26.167 0.000 46.662 0.000 types.py:322(<listcomp>) 71409609 25.725 0.000 25.725 0.000 types.py:1350(_create_row_inbound_converter) 5453560 25.075 0.000 251.842 0.000 types.py:456(__init__) 2354475 21.215 0.000 21.215 0.000 {built-in method fromtimestamp} 5453560/1588069 20.432 0.000 1238.601 0.001 types.py:560(fromJson) 5453560 20.328 0.000 20.328 0.000 types.py:473(<listcomp>) 5453560 19.526 0.000 78.813 0.000 {built-in method all} 19161994 16.877 0.000 78.520 0.000 types.py:319(fromInternal) 5453560 15.870 0.000 121.936 0.000 {built-in method any} 19815964 13.901 0.000 27.726 0.000 types.py:174(fromInternal) 12500825 13.825 0.000 13.825 0.000 {built-in method fromordinal} 114485449 12.640 0.000 12.640 0.000 types.py:73(needConversion) 19250503 12.552 0.000 15.145 0.000 types.py:311(needConversion) 2538382 8.971 0.000 38.776 0.000 types.py:194(fromInternal) 2354475 8.590 0.000 8.590 0.000 {method 'replace' of 'datetime.datetime' objects} 4352424 8.459 0.000 8.459 0.000 {method 'match' of '_sre.SRE_Pattern' objects} 1588069 7.168 0.000 357.679 0.000 decoder.py:338(decode) 1588069 5.234 0.000 365.104 0.000 __init__.py:271(loads) 5453560 3.650 0.000 4.979 0.000 types.py:524(__iter__) 19293583 2.599 0.000 2.599 0.000 types.py:564(needConversion) 1298132 2.476 0.000 373.483 0.000 types.py:306(fromJson) 873033 2.318 0.000 6.430 0.000 <ipython-input-8-1ecd485d1a19>:2(<lambda>) 2461110 2.294 0.000 2.294 0.000 {method 'startswith' of 'str' objects} 873041 1.695 0.000 4.112 0.000 types.py:1470(__getattr__) 1298132 1.655 0.000 2.191 0.000 types.py:283(__init__) 873041 1.534 0.000 1.678 0.000 types.py:1457(__getitem__) 5453960 1.329 0.000 1.329 0.000 {built-in method iter} 588143 1.073 0.000 1.073 0.000 types.py:216(__init__) 3176138 0.943 0.000 0.943 0.000 {method 'end' of '_sre.SRE_Match' objects} 1176286 0.673 0.000 0.673 0.000 {method 'group' of '_sre.SRE_Match' objects} 1617580 0.420 0.000 0.420 0.000 {built-in method len} 29903 0.398 0.000 3759.130 0.126 serializers.py:155(_read_with_length) 873041 0.327 0.000 0.327 0.000 {method 'index' of 'list' objects} 29903 0.274 0.000 157.542 0.005 serializers.py:542(read_int) 29903 0.263 0.000 3759.393 0.126 serializers.py:136(load_stream) 29503 0.198 0.000 3581.688 0.121 serializers.py:418(loads) 29903 0.104 0.000 0.104 0.000 {built-in method unpack} 407054 0.097 0.000 0.097 0.000 types.py:185(needConversion) 383366 0.080 0.000 0.080 0.000 types.py:167(needConversion) 400 0.007 0.000 3824.302 9.561 worker.py:165(process) 400 0.006 0.000 0.008 0.000 serializers.py:217(load_stream) 400 0.002 0.000 0.002 0.000 rdd.py:303(func) 400 0.001 0.000 0.001 0.000 serializers.py:220(_load_stream_without_unbatching) 800 0.001 0.000 0.001 0.000 {built-in method from_iterable} 400 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 4 0.000 0.000 0.000 0.000 {built-in method dumps} 4 0.000 0.000 0.000 0.000 serializers.py:414(dumps) 4 0.000 0.000 0.000 0.000 serializers.py:549(write_int) 8 0.000 0.000 0.000 0.000 {method 'write' of '_io.BufferedWriter' objects} 4 0.000 0.000 0.000 0.000 {built-in method pack} {code} was (Author: maver1ck): OK. I did tests with VisualVM and Python profiler enabled I set following options: {code} spark.driver.memory 8g spark.memory.fraction 0.6 # 3 executors spark.executor.memory 30g spark.executor.cores 15 # server has 40 cores so there is enough CPU for both Scala and Python spark.executor.extraJavaOptions -XX:NewRatio=4 -XX:+UseParallelOldGC -XX:-UseCompressedOops -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=4048 -Dcom.sun.management.jmxremote.rmi.port=4047 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false {code} I think those GC options eliminated GC overhead problem. *Total run time:* Spark 2.0 : 3,3 min Spark 1.6: 2,3 min I attached screenshots from VisualVM from one of the executors (both 1.6 and 2.0 version). As you can see there are 2 main problems: 1) PythonRunner time 2) Parquet reader time PythonRunner time could indicate problems with Python code so I paste profiles from Python BasicProfiler. And this is the moment that I don't understand results. >From JVM point of view Spark 2.0 is 50% slower >(org.apache.spark.api.python.PythonRunner$$anon$1.read() method) >From Python point of view Spark 2.0 is slightly faster. So where could be a problem ? Or maybe I'm wrong and this method is only time spend on reading data from Python ? Not Python code execution Spark 1.6 Profiler {code} ============================================================ Profile of RDD<id=11> ============================================================ 3860611667 function calls (3529336059 primitive calls) in 4360.151 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 29503 480.019 0.016 3968.411 0.135 {built-in method loads} 1588051 420.956 0.000 420.956 0.000 decoder.py:349(raw_decode) 14602755 362.611 0.000 912.538 0.000 types.py:565(<listcomp>) 59406 319.737 0.005 319.737 0.005 {method 'read' of '_io.BufferedReader' objects} 143753067/43088753 310.714 0.000 1263.782 0.000 types.py:424(fromJson) 689592988 309.769 0.000 549.927 0.000 types.py:437(fromInternal) 143753067 303.711 0.000 366.807 0.000 types.py:394(__init__) 71409609 268.747 0.000 480.577 0.000 types.py:1162(_create_row) 146639250/1588051 181.296 0.000 1376.390 0.001 types.py:736(_parse_datatype_json_value) 723603860 179.262 0.000 179.262 0.000 {built-in method isinstance} 149238722/71409609 146.782 0.000 1547.714 0.000 types.py:558(fromInternal) 71409609 99.052 0.000 99.052 0.000 {built-in method __new__ of type object at 0x9d1c40} 5453542/1588051 95.142 0.000 1288.681 0.001 types.py:527(<listcomp>) 139887576 67.573 0.000 67.573 0.000 {method 'keys' of 'dict' objects} 615371111 64.788 0.000 64.788 0.000 types.py:87(fromInternal) 71409609 64.314 0.000 64.314 0.000 types.py:1280(__setattr__) 400 63.731 0.159 4360.136 10.900 serializers.py:259(dump_stream) 71409609 53.644 0.000 1601.358 0.000 types.py:1159(<lambda>) 120860802 51.438 0.000 115.739 0.000 types.py:466(<genexpr>) 149206609 50.887 0.000 64.772 0.000 types.py:464(<genexpr>) 115407260 50.515 0.000 64.300 0.000 types.py:431(needConversion) 71409609 48.464 0.000 147.516 0.000 types.py:1194(__new__) 1588051 39.248 0.000 1859.901 0.001 types.py:684(_parse_datatype_json_string) 8915977 29.611 0.000 51.700 0.000 types.py:322(<listcomp>) 71409609 27.133 0.000 27.133 0.000 types.py:1158(_create_row_inbound_converter) 5453542/1588051 26.184 0.000 1371.215 0.001 types.py:525(fromJson) 5453542 23.354 0.000 264.314 0.000 types.py:446(__init__) 5453542 22.546 0.000 22.546 0.000 types.py:463(<listcomp>) 2354475 22.168 0.000 22.168 0.000 {built-in method fromtimestamp} 5453542 21.537 0.000 86.309 0.000 {built-in method all} 19161994 19.031 0.000 88.270 0.000 types.py:319(fromInternal) 5453542 16.367 0.000 131.358 0.000 {built-in method any} 19815964 15.244 0.000 29.920 0.000 types.py:174(fromInternal) 19250503 14.890 0.000 17.730 0.000 types.py:311(needConversion) 12500825 14.676 0.000 14.676 0.000 {built-in method fromordinal} 114485251 13.383 0.000 13.383 0.000 types.py:73(needConversion) 2538382 9.649 0.000 40.856 0.000 types.py:194(fromInternal) 2354475 9.039 0.000 9.039 0.000 {method 'replace' of 'datetime.datetime' objects} 4352388 8.644 0.000 8.644 0.000 {method 'match' of '_sre.SRE_Pattern' objects} 1588051 7.681 0.000 436.389 0.000 decoder.py:338(decode) 1588051 5.644 0.000 444.263 0.000 __init__.py:271(loads) 19293583 2.852 0.000 2.852 0.000 types.py:529(needConversion) 1298132 2.834 0.000 413.293 0.000 types.py:306(fromJson) 873033 2.422 0.000 6.821 0.000 <ipython-input-12-1ecd485d1a19>:2(<lambda>) 2461092 2.322 0.000 2.322 0.000 {method 'startswith' of 'str' objects} 873041 1.813 0.000 4.399 0.000 types.py:1267(__getattr__) 1298132 1.751 0.000 2.307 0.000 types.py:283(__init__) 873041 1.671 0.000 1.821 0.000 types.py:1254(__getitem__) 588143 1.171 0.000 1.171 0.000 types.py:216(__init__) 3176102 1.014 0.000 1.014 0.000 {method 'end' of '_sre.SRE_Match' objects} 1176286 0.700 0.000 0.700 0.000 {method 'group' of '_sre.SRE_Match' objects} 29903 0.457 0.000 4289.348 0.143 serializers.py:155(_read_with_length) 1617562 0.447 0.000 0.447 0.000 {built-in method len} 29903 0.371 0.000 300.548 0.010 serializers.py:542(read_int) 873041 0.340 0.000 0.340 0.000 {method 'index' of 'list' objects} 29903 0.236 0.000 4289.584 0.143 serializers.py:136(load_stream) 29503 0.227 0.000 3968.639 0.135 serializers.py:418(loads) 29903 0.130 0.000 0.130 0.000 {built-in method unpack} 407054 0.104 0.000 0.104 0.000 types.py:185(needConversion) 383366 0.093 0.000 0.093 0.000 types.py:167(needConversion) 400 0.006 0.000 4360.150 10.900 worker.py:104(process) 400 0.005 0.000 0.007 0.000 serializers.py:217(load_stream) 400 0.002 0.000 0.002 0.000 rdd.py:303(func) 400 0.001 0.000 0.001 0.000 serializers.py:220(_load_stream_without_unbatching) 800 0.000 0.000 0.000 0.000 {built-in method from_iterable} 4 0.000 0.000 0.000 0.000 {built-in method dumps} 400 0.000 0.000 0.000 0.000 {built-in method iter} 400 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 4 0.000 0.000 0.000 0.000 serializers.py:414(dumps) 4 0.000 0.000 0.000 0.000 serializers.py:549(write_int) 8 0.000 0.000 0.000 0.000 {method 'write' of '_io.BufferedWriter' objects} 4 0.000 0.000 0.000 0.000 {built-in method pack} {code} Spark 2.0 Profiler {code} ============================================================ Profile of RDD<id=9> ============================================================ 4015274806 function calls (3683999000 primitive calls) in 3824.303 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 29503 447.179 0.015 3581.490 0.121 {built-in method loads} 1588069 343.160 0.000 343.160 0.000 decoder.py:349(raw_decode) 14602755 331.047 0.000 837.768 0.000 types.py:600(<listcomp>) 689592988 285.243 0.000 506.721 0.000 types.py:438(fromInternal) 143753265 276.530 0.000 348.146 0.000 types.py:394(__init__) 143753265/43088951 261.310 0.000 1137.452 0.000 types.py:425(fromJson) 71409609 238.990 0.000 430.747 0.000 types.py:1354(_create_row) 867357953 181.193 0.000 181.193 0.000 {built-in method isinstance} 59406 176.652 0.003 176.652 0.003 {method 'read' of '_io.BufferedReader' objects} 146639466/1588069 168.921 0.000 1243.338 0.001 types.py:891(_parse_datatype_json_value) 149238722/71409609 136.504 0.000 1413.990 0.000 types.py:593(fromInternal) 71409609 84.433 0.000 84.433 0.000 {built-in method __new__ of type object at 0x9d1c40} 5453560/1588069 81.277 0.000 1160.041 0.001 types.py:562(<listcomp>) 615371111 61.250 0.000 61.250 0.000 types.py:87(fromInternal) 71409609 61.204 0.000 61.204 0.000 types.py:1483(__setattr__) 400 58.462 0.146 3824.286 9.561 serializers.py:259(dump_stream) 139887774 51.188 0.000 51.188 0.000 {method 'keys' of 'dict' objects} 71409609 51.071 0.000 1465.061 0.000 types.py:1351(<lambda>) 120861018 47.346 0.000 106.776 0.000 types.py:476(<genexpr>) 149206825 46.725 0.000 59.287 0.000 types.py:474(<genexpr>) 115407458 46.443 0.000 59.430 0.000 types.py:432(needConversion) 71409609 46.120 0.000 130.553 0.000 types.py:1400(__new__) 1588069 35.083 0.000 1643.525 0.001 types.py:842(_parse_datatype_json_string) 8915977 26.167 0.000 46.662 0.000 types.py:322(<listcomp>) 71409609 25.725 0.000 25.725 0.000 types.py:1350(_create_row_inbound_converter) 5453560 25.075 0.000 251.842 0.000 types.py:456(__init__) 2354475 21.215 0.000 21.215 0.000 {built-in method fromtimestamp} 5453560/1588069 20.432 0.000 1238.601 0.001 types.py:560(fromJson) 5453560 20.328 0.000 20.328 0.000 types.py:473(<listcomp>) 5453560 19.526 0.000 78.813 0.000 {built-in method all} 19161994 16.877 0.000 78.520 0.000 types.py:319(fromInternal) 5453560 15.870 0.000 121.936 0.000 {built-in method any} 19815964 13.901 0.000 27.726 0.000 types.py:174(fromInternal) 12500825 13.825 0.000 13.825 0.000 {built-in method fromordinal} 114485449 12.640 0.000 12.640 0.000 types.py:73(needConversion) 19250503 12.552 0.000 15.145 0.000 types.py:311(needConversion) 2538382 8.971 0.000 38.776 0.000 types.py:194(fromInternal) 2354475 8.590 0.000 8.590 0.000 {method 'replace' of 'datetime.datetime' objects} 4352424 8.459 0.000 8.459 0.000 {method 'match' of '_sre.SRE_Pattern' objects} 1588069 7.168 0.000 357.679 0.000 decoder.py:338(decode) 1588069 5.234 0.000 365.104 0.000 __init__.py:271(loads) 5453560 3.650 0.000 4.979 0.000 types.py:524(__iter__) 19293583 2.599 0.000 2.599 0.000 types.py:564(needConversion) 1298132 2.476 0.000 373.483 0.000 types.py:306(fromJson) 873033 2.318 0.000 6.430 0.000 <ipython-input-8-1ecd485d1a19>:2(<lambda>) 2461110 2.294 0.000 2.294 0.000 {method 'startswith' of 'str' objects} 873041 1.695 0.000 4.112 0.000 types.py:1470(__getattr__) 1298132 1.655 0.000 2.191 0.000 types.py:283(__init__) 873041 1.534 0.000 1.678 0.000 types.py:1457(__getitem__) 5453960 1.329 0.000 1.329 0.000 {built-in method iter} 588143 1.073 0.000 1.073 0.000 types.py:216(__init__) 3176138 0.943 0.000 0.943 0.000 {method 'end' of '_sre.SRE_Match' objects} 1176286 0.673 0.000 0.673 0.000 {method 'group' of '_sre.SRE_Match' objects} 1617580 0.420 0.000 0.420 0.000 {built-in method len} 29903 0.398 0.000 3759.130 0.126 serializers.py:155(_read_with_length) 873041 0.327 0.000 0.327 0.000 {method 'index' of 'list' objects} 29903 0.274 0.000 157.542 0.005 serializers.py:542(read_int) 29903 0.263 0.000 3759.393 0.126 serializers.py:136(load_stream) 29503 0.198 0.000 3581.688 0.121 serializers.py:418(loads) 29903 0.104 0.000 0.104 0.000 {built-in method unpack} 407054 0.097 0.000 0.097 0.000 types.py:185(needConversion) 383366 0.080 0.000 0.080 0.000 types.py:167(needConversion) 400 0.007 0.000 3824.302 9.561 worker.py:165(process) 400 0.006 0.000 0.008 0.000 serializers.py:217(load_stream) 400 0.002 0.000 0.002 0.000 rdd.py:303(func) 400 0.001 0.000 0.001 0.000 serializers.py:220(_load_stream_without_unbatching) 800 0.001 0.000 0.001 0.000 {built-in method from_iterable} 400 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 4 0.000 0.000 0.000 0.000 {built-in method dumps} 4 0.000 0.000 0.000 0.000 serializers.py:414(dumps) 4 0.000 0.000 0.000 0.000 serializers.py:549(write_int) 8 0.000 0.000 0.000 0.000 {method 'write' of '_io.BufferedWriter' objects} 4 0.000 0.000 0.000 0.000 {built-in method pack} {code} > Pyspark 2.0 performance drop vs pyspark 1.6 > ------------------------------------------- > > Key: SPARK-16321 > URL: https://issues.apache.org/jira/browse/SPARK-16321 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.0.0 > Reporter: Maciej Bryński > Attachments: visualvm_spark16.png, visualvm_spark2.png, > visualvm_spark2_G1GC.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is 2x slower. > {code} > df = sqlctx.read.parquet(path) > df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %100000 > else []).collect() > {code} > Spark 1.6 -> 2.3 min > Spark 2.0 -> 4.6 min (2x slower) > I used BasicProfiler for this task and cumulative time was: > Spark 1.6 - 4300 sec > Spark 2.0 - 5800 sec > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org