[jira] [Comment Edited] (SPARK-16321) Pyspark 2.0 performance drop vs pyspark 1.6

JIRA Mon, 18 Jul 2016 07:12:35 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15382113#comment-15382113
 ]


Maciej Bryński edited comment on SPARK-16321 at 7/18/16 2:12 PM:
-----------------------------------------------------------------

OK. I did tests with VisualVM and Python profiler enabled
I set following options:
{code}
spark.driver.memory    8g
spark.memory.fraction   0.6
# 3 executors
spark.executor.memory    30g
spark.executor.cores   15 # server has 40 cores so there is enough CPU for both 
Scala and Python

spark.executor.extraJavaOptions -XX:NewRatio=4 -XX:+UseParallelOldGC  
-XX:-UseCompressedOops 
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=4048 
-Dcom.sun.management.jmxremote.rmi.port=4047 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false
{code}

I think those GC options eliminated GC overhead problem.

*Total run time:*
Spark 2.0 : 3,3 min
Spark 1.6: 2,3 min

I attached screenshots from VisualVM from one of the executors (both 1.6 and 
2.0 version).
As you can see there are 2 main problems:
1) PythonRunner time 
2) Parquet reader time

PythonRunner time could indicate problems with Python code so I paste profiles 
from Python BasicProfiler.
And this is the moment that I don't understand results.
>From JVM point of view Spark 2.0 is 50% slower 
>(org.apache.spark.api.python.PythonRunner$$anon$1.read() method)
>From Python point of view Spark 2.0 is slightly faster.
So where could be a problem ?

Spark 1.6 Profiler
{code}
============================================================
Profile of RDD<id=11>
============================================================
         3860611667 function calls (3529336059 primitive calls) in 4360.151 
seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    29503  480.019    0.016 3968.411    0.135 {built-in method loads}
  1588051  420.956    0.000  420.956    0.000 decoder.py:349(raw_decode)
 14602755  362.611    0.000  912.538    0.000 types.py:565(<listcomp>)
    59406  319.737    0.005  319.737    0.005 {method 'read' of 
'_io.BufferedReader' objects}
143753067/43088753  310.714    0.000 1263.782    0.000 types.py:424(fromJson)
689592988  309.769    0.000  549.927    0.000 types.py:437(fromInternal)
143753067  303.711    0.000  366.807    0.000 types.py:394(__init__)
 71409609  268.747    0.000  480.577    0.000 types.py:1162(_create_row)
146639250/1588051  181.296    0.000 1376.390    0.001 
types.py:736(_parse_datatype_json_value)
723603860  179.262    0.000  179.262    0.000 {built-in method isinstance}
149238722/71409609  146.782    0.000 1547.714    0.000 
types.py:558(fromInternal)
 71409609   99.052    0.000   99.052    0.000 {built-in method __new__ of type 
object at 0x9d1c40}
5453542/1588051   95.142    0.000 1288.681    0.001 types.py:527(<listcomp>)
139887576   67.573    0.000   67.573    0.000 {method 'keys' of 'dict' objects}
615371111   64.788    0.000   64.788    0.000 types.py:87(fromInternal)
 71409609   64.314    0.000   64.314    0.000 types.py:1280(__setattr__)
      400   63.731    0.159 4360.136   10.900 serializers.py:259(dump_stream)
 71409609   53.644    0.000 1601.358    0.000 types.py:1159(<lambda>)
120860802   51.438    0.000  115.739    0.000 types.py:466(<genexpr>)
149206609   50.887    0.000   64.772    0.000 types.py:464(<genexpr>)
115407260   50.515    0.000   64.300    0.000 types.py:431(needConversion)
 71409609   48.464    0.000  147.516    0.000 types.py:1194(__new__)
  1588051   39.248    0.000 1859.901    0.001 
types.py:684(_parse_datatype_json_string)
  8915977   29.611    0.000   51.700    0.000 types.py:322(<listcomp>)
 71409609   27.133    0.000   27.133    0.000 
types.py:1158(_create_row_inbound_converter)
5453542/1588051   26.184    0.000 1371.215    0.001 types.py:525(fromJson)
  5453542   23.354    0.000  264.314    0.000 types.py:446(__init__)
  5453542   22.546    0.000   22.546    0.000 types.py:463(<listcomp>)
  2354475   22.168    0.000   22.168    0.000 {built-in method fromtimestamp}
  5453542   21.537    0.000   86.309    0.000 {built-in method all}
 19161994   19.031    0.000   88.270    0.000 types.py:319(fromInternal)
  5453542   16.367    0.000  131.358    0.000 {built-in method any}
 19815964   15.244    0.000   29.920    0.000 types.py:174(fromInternal)
 19250503   14.890    0.000   17.730    0.000 types.py:311(needConversion)
 12500825   14.676    0.000   14.676    0.000 {built-in method fromordinal}
114485251   13.383    0.000   13.383    0.000 types.py:73(needConversion)
  2538382    9.649    0.000   40.856    0.000 types.py:194(fromInternal)
  2354475    9.039    0.000    9.039    0.000 {method 'replace' of 
'datetime.datetime' objects}
  4352388    8.644    0.000    8.644    0.000 {method 'match' of 
'_sre.SRE_Pattern' objects}
  1588051    7.681    0.000  436.389    0.000 decoder.py:338(decode)
  1588051    5.644    0.000  444.263    0.000 __init__.py:271(loads)
 19293583    2.852    0.000    2.852    0.000 types.py:529(needConversion)
  1298132    2.834    0.000  413.293    0.000 types.py:306(fromJson)
   873033    2.422    0.000    6.821    0.000 
<ipython-input-12-1ecd485d1a19>:2(<lambda>)
  2461092    2.322    0.000    2.322    0.000 {method 'startswith' of 'str' 
objects}
   873041    1.813    0.000    4.399    0.000 types.py:1267(__getattr__)
  1298132    1.751    0.000    2.307    0.000 types.py:283(__init__)
   873041    1.671    0.000    1.821    0.000 types.py:1254(__getitem__)
   588143    1.171    0.000    1.171    0.000 types.py:216(__init__)
  3176102    1.014    0.000    1.014    0.000 {method 'end' of '_sre.SRE_Match' 
objects}
  1176286    0.700    0.000    0.700    0.000 {method 'group' of 
'_sre.SRE_Match' objects}
    29903    0.457    0.000 4289.348    0.143 
serializers.py:155(_read_with_length)
  1617562    0.447    0.000    0.447    0.000 {built-in method len}
    29903    0.371    0.000  300.548    0.010 serializers.py:542(read_int)
   873041    0.340    0.000    0.340    0.000 {method 'index' of 'list' objects}
    29903    0.236    0.000 4289.584    0.143 serializers.py:136(load_stream)
    29503    0.227    0.000 3968.639    0.135 serializers.py:418(loads)
    29903    0.130    0.000    0.130    0.000 {built-in method unpack}
   407054    0.104    0.000    0.104    0.000 types.py:185(needConversion)
   383366    0.093    0.000    0.093    0.000 types.py:167(needConversion)
      400    0.006    0.000 4360.150   10.900 worker.py:104(process)
      400    0.005    0.000    0.007    0.000 serializers.py:217(load_stream)
      400    0.002    0.000    0.002    0.000 rdd.py:303(func)
      400    0.001    0.000    0.001    0.000 
serializers.py:220(_load_stream_without_unbatching)
      800    0.000    0.000    0.000    0.000 {built-in method from_iterable}
        4    0.000    0.000    0.000    0.000 {built-in method dumps}
      400    0.000    0.000    0.000    0.000 {built-in method iter}
      400    0.000    0.000    0.000    0.000 {method 'disable' of 
'_lsprof.Profiler' objects}
        4    0.000    0.000    0.000    0.000 serializers.py:414(dumps)
        4    0.000    0.000    0.000    0.000 serializers.py:549(write_int)
        8    0.000    0.000    0.000    0.000 {method 'write' of 
'_io.BufferedWriter' objects}
        4    0.000    0.000    0.000    0.000 {built-in method pack}
{code}

Spark 2.0 Profiler
{code}
============================================================
Profile of RDD<id=9>
============================================================
         4015274806 function calls (3683999000 primitive calls) in 3824.303 
seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    29503  447.179    0.015 3581.490    0.121 {built-in method loads}
  1588069  343.160    0.000  343.160    0.000 decoder.py:349(raw_decode)
 14602755  331.047    0.000  837.768    0.000 types.py:600(<listcomp>)
689592988  285.243    0.000  506.721    0.000 types.py:438(fromInternal)
143753265  276.530    0.000  348.146    0.000 types.py:394(__init__)
143753265/43088951  261.310    0.000 1137.452    0.000 types.py:425(fromJson)
 71409609  238.990    0.000  430.747    0.000 types.py:1354(_create_row)
867357953  181.193    0.000  181.193    0.000 {built-in method isinstance}
    59406  176.652    0.003  176.652    0.003 {method 'read' of 
'_io.BufferedReader' objects}
146639466/1588069  168.921    0.000 1243.338    0.001 
types.py:891(_parse_datatype_json_value)
149238722/71409609  136.504    0.000 1413.990    0.000 
types.py:593(fromInternal)
 71409609   84.433    0.000   84.433    0.000 {built-in method __new__ of type 
object at 0x9d1c40}
5453560/1588069   81.277    0.000 1160.041    0.001 types.py:562(<listcomp>)
615371111   61.250    0.000   61.250    0.000 types.py:87(fromInternal)
 71409609   61.204    0.000   61.204    0.000 types.py:1483(__setattr__)
      400   58.462    0.146 3824.286    9.561 serializers.py:259(dump_stream)
139887774   51.188    0.000   51.188    0.000 {method 'keys' of 'dict' objects}
 71409609   51.071    0.000 1465.061    0.000 types.py:1351(<lambda>)
120861018   47.346    0.000  106.776    0.000 types.py:476(<genexpr>)
149206825   46.725    0.000   59.287    0.000 types.py:474(<genexpr>)
115407458   46.443    0.000   59.430    0.000 types.py:432(needConversion)
 71409609   46.120    0.000  130.553    0.000 types.py:1400(__new__)
  1588069   35.083    0.000 1643.525    0.001 
types.py:842(_parse_datatype_json_string)
  8915977   26.167    0.000   46.662    0.000 types.py:322(<listcomp>)
 71409609   25.725    0.000   25.725    0.000 
types.py:1350(_create_row_inbound_converter)
  5453560   25.075    0.000  251.842    0.000 types.py:456(__init__)
  2354475   21.215    0.000   21.215    0.000 {built-in method fromtimestamp}
5453560/1588069   20.432    0.000 1238.601    0.001 types.py:560(fromJson)
  5453560   20.328    0.000   20.328    0.000 types.py:473(<listcomp>)
  5453560   19.526    0.000   78.813    0.000 {built-in method all}
 19161994   16.877    0.000   78.520    0.000 types.py:319(fromInternal)
  5453560   15.870    0.000  121.936    0.000 {built-in method any}
 19815964   13.901    0.000   27.726    0.000 types.py:174(fromInternal)
 12500825   13.825    0.000   13.825    0.000 {built-in method fromordinal}
114485449   12.640    0.000   12.640    0.000 types.py:73(needConversion)
 19250503   12.552    0.000   15.145    0.000 types.py:311(needConversion)
  2538382    8.971    0.000   38.776    0.000 types.py:194(fromInternal)
  2354475    8.590    0.000    8.590    0.000 {method 'replace' of 
'datetime.datetime' objects}
  4352424    8.459    0.000    8.459    0.000 {method 'match' of 
'_sre.SRE_Pattern' objects}
  1588069    7.168    0.000  357.679    0.000 decoder.py:338(decode)
  1588069    5.234    0.000  365.104    0.000 __init__.py:271(loads)
  5453560    3.650    0.000    4.979    0.000 types.py:524(__iter__)
 19293583    2.599    0.000    2.599    0.000 types.py:564(needConversion)
  1298132    2.476    0.000  373.483    0.000 types.py:306(fromJson)
   873033    2.318    0.000    6.430    0.000 
<ipython-input-8-1ecd485d1a19>:2(<lambda>)
  2461110    2.294    0.000    2.294    0.000 {method 'startswith' of 'str' 
objects}
   873041    1.695    0.000    4.112    0.000 types.py:1470(__getattr__)
  1298132    1.655    0.000    2.191    0.000 types.py:283(__init__)
   873041    1.534    0.000    1.678    0.000 types.py:1457(__getitem__)
  5453960    1.329    0.000    1.329    0.000 {built-in method iter}
   588143    1.073    0.000    1.073    0.000 types.py:216(__init__)
  3176138    0.943    0.000    0.943    0.000 {method 'end' of '_sre.SRE_Match' 
objects}
  1176286    0.673    0.000    0.673    0.000 {method 'group' of 
'_sre.SRE_Match' objects}
  1617580    0.420    0.000    0.420    0.000 {built-in method len}
    29903    0.398    0.000 3759.130    0.126 
serializers.py:155(_read_with_length)
   873041    0.327    0.000    0.327    0.000 {method 'index' of 'list' objects}
    29903    0.274    0.000  157.542    0.005 serializers.py:542(read_int)
    29903    0.263    0.000 3759.393    0.126 serializers.py:136(load_stream)
    29503    0.198    0.000 3581.688    0.121 serializers.py:418(loads)
    29903    0.104    0.000    0.104    0.000 {built-in method unpack}
   407054    0.097    0.000    0.097    0.000 types.py:185(needConversion)
   383366    0.080    0.000    0.080    0.000 types.py:167(needConversion)
      400    0.007    0.000 3824.302    9.561 worker.py:165(process)
      400    0.006    0.000    0.008    0.000 serializers.py:217(load_stream)
      400    0.002    0.000    0.002    0.000 rdd.py:303(func)
      400    0.001    0.000    0.001    0.000 
serializers.py:220(_load_stream_without_unbatching)
      800    0.001    0.000    0.001    0.000 {built-in method from_iterable}
      400    0.000    0.000    0.000    0.000 {method 'disable' of 
'_lsprof.Profiler' objects}
        4    0.000    0.000    0.000    0.000 {built-in method dumps}
        4    0.000    0.000    0.000    0.000 serializers.py:414(dumps)
        4    0.000    0.000    0.000    0.000 serializers.py:549(write_int)
        8    0.000    0.000    0.000    0.000 {method 'write' of 
'_io.BufferedWriter' objects}
        4    0.000    0.000    0.000    0.000 {built-in method pack}
{code}



was (Author: maver1ck):
OK. I did tests with VisualVM and Python profiler enabled
I set following options:
{code}
spark.driver.memory    8g
spark.memory.fraction   0.6
# 3 executors
spark.executor.memory    30g
spark.executor.cores   15 # server has 40 cores so there is enough CPU for both 
Scala and Python

spark.executor.extraJavaOptions -XX:NewRatio=4 -XX:+UseParallelOldGC  
-XX:-UseCompressedOops 
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=4048 
-Dcom.sun.management.jmxremote.rmi.port=4047 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false
{code}

I think those GC options eliminated GC overhead problem.

*Total run time:*
Spark 2.0 : 3,3 min
Spark 1.6: 2,3 min

I attached screenshots from VisualVM from one of the executors (both 1.6 and 
2.0 version).
As you can see there are 2 main problems:
1) PythonRunner time 
2) Parquet reader time

PythonRunner time could indicate problems with Python code so I paste profiles 
from Python BasicProfiler.
And this is the moment that I don't understand results.
>From JVM point of view Spark 2.0 is 50% slower 
>(org.apache.spark.api.python.PythonRunner$$anon$1.read() method)
>From Python point of view Spark 2.0 is slightly faster.
So where could be a problem ?
Or maybe I'm wrong and this method is only time spend on reading data from 
Python ? Not Python code execution

Spark 1.6 Profiler
{code}
============================================================
Profile of RDD<id=11>
============================================================
         3860611667 function calls (3529336059 primitive calls) in 4360.151 
seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    29503  480.019    0.016 3968.411    0.135 {built-in method loads}
  1588051  420.956    0.000  420.956    0.000 decoder.py:349(raw_decode)
 14602755  362.611    0.000  912.538    0.000 types.py:565(<listcomp>)
    59406  319.737    0.005  319.737    0.005 {method 'read' of 
'_io.BufferedReader' objects}
143753067/43088753  310.714    0.000 1263.782    0.000 types.py:424(fromJson)
689592988  309.769    0.000  549.927    0.000 types.py:437(fromInternal)
143753067  303.711    0.000  366.807    0.000 types.py:394(__init__)
 71409609  268.747    0.000  480.577    0.000 types.py:1162(_create_row)
146639250/1588051  181.296    0.000 1376.390    0.001 
types.py:736(_parse_datatype_json_value)
723603860  179.262    0.000  179.262    0.000 {built-in method isinstance}
149238722/71409609  146.782    0.000 1547.714    0.000 
types.py:558(fromInternal)
 71409609   99.052    0.000   99.052    0.000 {built-in method __new__ of type 
object at 0x9d1c40}
5453542/1588051   95.142    0.000 1288.681    0.001 types.py:527(<listcomp>)
139887576   67.573    0.000   67.573    0.000 {method 'keys' of 'dict' objects}
615371111   64.788    0.000   64.788    0.000 types.py:87(fromInternal)
 71409609   64.314    0.000   64.314    0.000 types.py:1280(__setattr__)
      400   63.731    0.159 4360.136   10.900 serializers.py:259(dump_stream)
 71409609   53.644    0.000 1601.358    0.000 types.py:1159(<lambda>)
120860802   51.438    0.000  115.739    0.000 types.py:466(<genexpr>)
149206609   50.887    0.000   64.772    0.000 types.py:464(<genexpr>)
115407260   50.515    0.000   64.300    0.000 types.py:431(needConversion)
 71409609   48.464    0.000  147.516    0.000 types.py:1194(__new__)
  1588051   39.248    0.000 1859.901    0.001 
types.py:684(_parse_datatype_json_string)
  8915977   29.611    0.000   51.700    0.000 types.py:322(<listcomp>)
 71409609   27.133    0.000   27.133    0.000 
types.py:1158(_create_row_inbound_converter)
5453542/1588051   26.184    0.000 1371.215    0.001 types.py:525(fromJson)
  5453542   23.354    0.000  264.314    0.000 types.py:446(__init__)
  5453542   22.546    0.000   22.546    0.000 types.py:463(<listcomp>)
  2354475   22.168    0.000   22.168    0.000 {built-in method fromtimestamp}
  5453542   21.537    0.000   86.309    0.000 {built-in method all}
 19161994   19.031    0.000   88.270    0.000 types.py:319(fromInternal)
  5453542   16.367    0.000  131.358    0.000 {built-in method any}
 19815964   15.244    0.000   29.920    0.000 types.py:174(fromInternal)
 19250503   14.890    0.000   17.730    0.000 types.py:311(needConversion)
 12500825   14.676    0.000   14.676    0.000 {built-in method fromordinal}
114485251   13.383    0.000   13.383    0.000 types.py:73(needConversion)
  2538382    9.649    0.000   40.856    0.000 types.py:194(fromInternal)
  2354475    9.039    0.000    9.039    0.000 {method 'replace' of 
'datetime.datetime' objects}
  4352388    8.644    0.000    8.644    0.000 {method 'match' of 
'_sre.SRE_Pattern' objects}
  1588051    7.681    0.000  436.389    0.000 decoder.py:338(decode)
  1588051    5.644    0.000  444.263    0.000 __init__.py:271(loads)
 19293583    2.852    0.000    2.852    0.000 types.py:529(needConversion)
  1298132    2.834    0.000  413.293    0.000 types.py:306(fromJson)
   873033    2.422    0.000    6.821    0.000 
<ipython-input-12-1ecd485d1a19>:2(<lambda>)
  2461092    2.322    0.000    2.322    0.000 {method 'startswith' of 'str' 
objects}
   873041    1.813    0.000    4.399    0.000 types.py:1267(__getattr__)
  1298132    1.751    0.000    2.307    0.000 types.py:283(__init__)
   873041    1.671    0.000    1.821    0.000 types.py:1254(__getitem__)
   588143    1.171    0.000    1.171    0.000 types.py:216(__init__)
  3176102    1.014    0.000    1.014    0.000 {method 'end' of '_sre.SRE_Match' 
objects}
  1176286    0.700    0.000    0.700    0.000 {method 'group' of 
'_sre.SRE_Match' objects}
    29903    0.457    0.000 4289.348    0.143 
serializers.py:155(_read_with_length)
  1617562    0.447    0.000    0.447    0.000 {built-in method len}
    29903    0.371    0.000  300.548    0.010 serializers.py:542(read_int)
   873041    0.340    0.000    0.340    0.000 {method 'index' of 'list' objects}
    29903    0.236    0.000 4289.584    0.143 serializers.py:136(load_stream)
    29503    0.227    0.000 3968.639    0.135 serializers.py:418(loads)
    29903    0.130    0.000    0.130    0.000 {built-in method unpack}
   407054    0.104    0.000    0.104    0.000 types.py:185(needConversion)
   383366    0.093    0.000    0.093    0.000 types.py:167(needConversion)
      400    0.006    0.000 4360.150   10.900 worker.py:104(process)
      400    0.005    0.000    0.007    0.000 serializers.py:217(load_stream)
      400    0.002    0.000    0.002    0.000 rdd.py:303(func)
      400    0.001    0.000    0.001    0.000 
serializers.py:220(_load_stream_without_unbatching)
      800    0.000    0.000    0.000    0.000 {built-in method from_iterable}
        4    0.000    0.000    0.000    0.000 {built-in method dumps}
      400    0.000    0.000    0.000    0.000 {built-in method iter}
      400    0.000    0.000    0.000    0.000 {method 'disable' of 
'_lsprof.Profiler' objects}
        4    0.000    0.000    0.000    0.000 serializers.py:414(dumps)
        4    0.000    0.000    0.000    0.000 serializers.py:549(write_int)
        8    0.000    0.000    0.000    0.000 {method 'write' of 
'_io.BufferedWriter' objects}
        4    0.000    0.000    0.000    0.000 {built-in method pack}
{code}

Spark 2.0 Profiler
{code}
============================================================
Profile of RDD<id=9>
============================================================
         4015274806 function calls (3683999000 primitive calls) in 3824.303 
seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    29503  447.179    0.015 3581.490    0.121 {built-in method loads}
  1588069  343.160    0.000  343.160    0.000 decoder.py:349(raw_decode)
 14602755  331.047    0.000  837.768    0.000 types.py:600(<listcomp>)
689592988  285.243    0.000  506.721    0.000 types.py:438(fromInternal)
143753265  276.530    0.000  348.146    0.000 types.py:394(__init__)
143753265/43088951  261.310    0.000 1137.452    0.000 types.py:425(fromJson)
 71409609  238.990    0.000  430.747    0.000 types.py:1354(_create_row)
867357953  181.193    0.000  181.193    0.000 {built-in method isinstance}
    59406  176.652    0.003  176.652    0.003 {method 'read' of 
'_io.BufferedReader' objects}
146639466/1588069  168.921    0.000 1243.338    0.001 
types.py:891(_parse_datatype_json_value)
149238722/71409609  136.504    0.000 1413.990    0.000 
types.py:593(fromInternal)
 71409609   84.433    0.000   84.433    0.000 {built-in method __new__ of type 
object at 0x9d1c40}
5453560/1588069   81.277    0.000 1160.041    0.001 types.py:562(<listcomp>)
615371111   61.250    0.000   61.250    0.000 types.py:87(fromInternal)
 71409609   61.204    0.000   61.204    0.000 types.py:1483(__setattr__)
      400   58.462    0.146 3824.286    9.561 serializers.py:259(dump_stream)
139887774   51.188    0.000   51.188    0.000 {method 'keys' of 'dict' objects}
 71409609   51.071    0.000 1465.061    0.000 types.py:1351(<lambda>)
120861018   47.346    0.000  106.776    0.000 types.py:476(<genexpr>)
149206825   46.725    0.000   59.287    0.000 types.py:474(<genexpr>)
115407458   46.443    0.000   59.430    0.000 types.py:432(needConversion)
 71409609   46.120    0.000  130.553    0.000 types.py:1400(__new__)
  1588069   35.083    0.000 1643.525    0.001 
types.py:842(_parse_datatype_json_string)
  8915977   26.167    0.000   46.662    0.000 types.py:322(<listcomp>)
 71409609   25.725    0.000   25.725    0.000 
types.py:1350(_create_row_inbound_converter)
  5453560   25.075    0.000  251.842    0.000 types.py:456(__init__)
  2354475   21.215    0.000   21.215    0.000 {built-in method fromtimestamp}
5453560/1588069   20.432    0.000 1238.601    0.001 types.py:560(fromJson)
  5453560   20.328    0.000   20.328    0.000 types.py:473(<listcomp>)
  5453560   19.526    0.000   78.813    0.000 {built-in method all}
 19161994   16.877    0.000   78.520    0.000 types.py:319(fromInternal)
  5453560   15.870    0.000  121.936    0.000 {built-in method any}
 19815964   13.901    0.000   27.726    0.000 types.py:174(fromInternal)
 12500825   13.825    0.000   13.825    0.000 {built-in method fromordinal}
114485449   12.640    0.000   12.640    0.000 types.py:73(needConversion)
 19250503   12.552    0.000   15.145    0.000 types.py:311(needConversion)
  2538382    8.971    0.000   38.776    0.000 types.py:194(fromInternal)
  2354475    8.590    0.000    8.590    0.000 {method 'replace' of 
'datetime.datetime' objects}
  4352424    8.459    0.000    8.459    0.000 {method 'match' of 
'_sre.SRE_Pattern' objects}
  1588069    7.168    0.000  357.679    0.000 decoder.py:338(decode)
  1588069    5.234    0.000  365.104    0.000 __init__.py:271(loads)
  5453560    3.650    0.000    4.979    0.000 types.py:524(__iter__)
 19293583    2.599    0.000    2.599    0.000 types.py:564(needConversion)
  1298132    2.476    0.000  373.483    0.000 types.py:306(fromJson)
   873033    2.318    0.000    6.430    0.000 
<ipython-input-8-1ecd485d1a19>:2(<lambda>)
  2461110    2.294    0.000    2.294    0.000 {method 'startswith' of 'str' 
objects}
   873041    1.695    0.000    4.112    0.000 types.py:1470(__getattr__)
  1298132    1.655    0.000    2.191    0.000 types.py:283(__init__)
   873041    1.534    0.000    1.678    0.000 types.py:1457(__getitem__)
  5453960    1.329    0.000    1.329    0.000 {built-in method iter}
   588143    1.073    0.000    1.073    0.000 types.py:216(__init__)
  3176138    0.943    0.000    0.943    0.000 {method 'end' of '_sre.SRE_Match' 
objects}
  1176286    0.673    0.000    0.673    0.000 {method 'group' of 
'_sre.SRE_Match' objects}
  1617580    0.420    0.000    0.420    0.000 {built-in method len}
    29903    0.398    0.000 3759.130    0.126 
serializers.py:155(_read_with_length)
   873041    0.327    0.000    0.327    0.000 {method 'index' of 'list' objects}
    29903    0.274    0.000  157.542    0.005 serializers.py:542(read_int)
    29903    0.263    0.000 3759.393    0.126 serializers.py:136(load_stream)
    29503    0.198    0.000 3581.688    0.121 serializers.py:418(loads)
    29903    0.104    0.000    0.104    0.000 {built-in method unpack}
   407054    0.097    0.000    0.097    0.000 types.py:185(needConversion)
   383366    0.080    0.000    0.080    0.000 types.py:167(needConversion)
      400    0.007    0.000 3824.302    9.561 worker.py:165(process)
      400    0.006    0.000    0.008    0.000 serializers.py:217(load_stream)
      400    0.002    0.000    0.002    0.000 rdd.py:303(func)
      400    0.001    0.000    0.001    0.000 
serializers.py:220(_load_stream_without_unbatching)
      800    0.001    0.000    0.001    0.000 {built-in method from_iterable}
      400    0.000    0.000    0.000    0.000 {method 'disable' of 
'_lsprof.Profiler' objects}
        4    0.000    0.000    0.000    0.000 {built-in method dumps}
        4    0.000    0.000    0.000    0.000 serializers.py:414(dumps)
        4    0.000    0.000    0.000    0.000 serializers.py:549(write_int)
        8    0.000    0.000    0.000    0.000 {method 'write' of 
'_io.BufferedWriter' objects}
        4    0.000    0.000    0.000    0.000 {built-in method pack}
{code}


> Pyspark 2.0 performance drop vs pyspark 1.6
> -------------------------------------------
>
>                 Key: SPARK-16321
>                 URL: https://issues.apache.org/jira/browse/SPARK-16321
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.0.0
>            Reporter: Maciej Bryński
>         Attachments: visualvm_spark16.png, visualvm_spark2.png, 
> visualvm_spark2_G1GC.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is 2x slower.
> {code}
> df = sqlctx.read.parquet(path)
> df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %100000 
> else []).collect()
> {code}
> Spark 1.6 -> 2.3 min
> Spark 2.0 -> 4.6 min (2x slower)
> I used BasicProfiler for this task and cumulative time was:
> Spark 1.6 - 4300 sec
> Spark 2.0 - 5800 sec
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16321) Pyspark 2.0 performance drop vs pyspark 1.6

Reply via email to