[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...

eatoncys Tue, 01 Aug 2017 21:57:06 -0700

GitHub user eatoncys opened a pull request:

    https://github.com/apache/spark/pull/18810


    [SPARK-21603][sql]The wholestage codegen will be much slower then 
wholestage codegen is closed when the function is too long

    ## What changes were proposed in this pull request?
    Close the whole stage codegen when the function lines is longer than the 
maxlines which will be setted by
    spark.sql.codegen.MaxFunctionLength parameter, because when the function is 
too long , it will not get the JIT  optimizing.
    A benchmark test result is 10x slower when the generated function is too 
long :
    
    ignore("max function length of wholestagecodegen") {
        val N = 20 << 15
    
        val benchmark = new Benchmark("max function length of 
wholestagecodegen", N)
        def f(): Unit = sparkSession.range(N)
          .selectExpr(
            "id",
            "(id & 1023) as k1",
            "cast(id & 1023 as double) as k2",
            "cast(id & 1023 as int) as k3",
            "case when id > 100 and id <= 200 then 1 else 0 end as v1",
            "case when id > 200 and id <= 300 then 1 else 0 end as v2",
            "case when id > 300 and id <= 400 then 1 else 0 end as v3",
            "case when id > 400 and id <= 500 then 1 else 0 end as v4",
            "case when id > 500 and id <= 600 then 1 else 0 end as v5",
            "case when id > 600 and id <= 700 then 1 else 0 end as v6",
            "case when id > 700 and id <= 800 then 1 else 0 end as v7",
            "case when id > 800 and id <= 900 then 1 else 0 end as v8",
            "case when id > 900 and id <= 1000 then 1 else 0 end as v9",
            "case when id > 1000 and id <= 1100 then 1 else 0 end as v10",
            "case when id > 1100 and id <= 1200 then 1 else 0 end as v11",
            "case when id > 1200 and id <= 1300 then 1 else 0 end as v12",
            "case when id > 1300 and id <= 1400 then 1 else 0 end as v13",
            "case when id > 1400 and id <= 1500 then 1 else 0 end as v14",
            "case when id > 1500 and id <= 1600 then 1 else 0 end as v15",
            "case when id > 1600 and id <= 1700 then 1 else 0 end as v16",
            "case when id > 1700 and id <= 1800 then 1 else 0 end as v17",
            "case when id > 1800 and id <= 1900 then 1 else 0 end as v18")
          .groupBy("k1", "k2", "k3")
          .sum()
          .collect()
    
        benchmark.addCase(s"codegen = F") { iter =>
          sparkSession.conf.set("spark.sql.codegen.wholeStage", "false")
          f()
        }
    
        benchmark.addCase(s"codegen = T") { iter =>
          sparkSession.conf.set("spark.sql.codegen.wholeStage", "true")
          sparkSession.conf.set("spark.sql.codegen.MaxFunctionLength", "10000")
          f()
        }
    
        benchmark.run()
    
        /*
        Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14 on Windows 7 6.1
        Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
        max function length of wholestagecodegen: Best/Avg Time(ms)    
Rate(M/s)   Per Row(ns)   Relative
        
------------------------------------------------------------------------------------------------
        codegen = F                                    443 /  507          1.5  
       676.0       1.0X
        codegen = T                                   3279 / 3283          0.2  
      5002.6       0.1X
         */
      }
    
    
    ## How was this patch tested?
    Run the unit test


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/eatoncys/spark codegen

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18810.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18810
    
----
commit ca9eff68424511fa11cc2bd695f1fddaae178e3c
Author: 10129659 <[email protected]>
Date:   2017-08-02T03:48:21Z

    The wholestage codegen will be slower when the function is too long

commit 1b0ac5ed896136df3579a61d7ef93980c0647e97
Author: 10129659 <[email protected]>
Date:   2017-08-02T04:41:24Z

    The wholestage codegen will be slower when the function is too long

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...

Reply via email to