Re: Spark 3 is Slower than Spark 2 for TPCDS Q04 query.

Senthil Kumar Mon, 20 Dec 2021 09:39:00 -0800

Hi Luca,

I m collecting logical n physical plan. So that it will be helpful to find
the root cause of this issue


On Mon, 20 Dec 2021, 16:46 Luca Canali, <luca.can...@cern.ch> wrote:

> Hi Senthil,
>
>
>
> I have just run a couple of quick tests for TPCDS Q4, using the TPCDS
> schema created at scale 1500 that I have on a Hadoop/YARN cluster, and was
> not able to reproduce the difference in execution time between Spark 2 and
> Spark 3 that you report in your mail.
>
> This is the Spark config I used:
>
> bin/spark-shell --master yarn --driver-memory 8g --executor-cores 10
> --executor-memory 50g --conf spark.dynamicAllocation.enabled=false
> --num-executors 20
>
>
>
> This is how I ran the tests:
>
>
>
> ```
>
> val path="/project/spark/TPCDS/tpcds_1500_parquet_1.10.1/"
>
>
>
> val
> tables=List("catalog_returns","catalog_sales","inventory","store_returns","store_sales","web_returns","web_sales",
> "call_center","catalog_page","customer","customer_address","customer_demographics","date_dim","household_demographics","income_band","item","promotion","reason","ship_mode","store","time_dim","warehouse","web_page","web_site")
>
>
>
> for (t <- tables) {
>
>   println(s"Creating temporary view $t")
>
>   spark.read.parquet(path + t).createOrReplaceTempView(t)
>
> }
>
>
>
> val q4="""…"""
>
> // SQL from
> https://github.com/databricks/spark-sql-perf/blob/master/src/main/resources/tpcds_2_4/q4.sql
>
>
>
> spark.time(sql(q4).collect) // note q4 result set is only 100 rows
>
> ```
>
>
>
> Spark 2.4.5:
>
> Time taken: 256812 ms
>
> Time taken: 226571 ms
>
> Time taken: 305508 ms
>
>
>
> Spark 3.1.2
>
> spark.time(sql(q4).collect)
>
> Time taken: 235356 ms
>
> Time taken: 236284 ms
>
>
>
> Best,
>
> Luca
>
>
>
> *From:* Senthil Kumar <sen...@gmail.com>
> *Sent:* Monday, December 20, 2021 10:20
> *To:* Rao, Abhishek (Nokia - IN/Bangalore) <abhishek....@nokia.com>
> *Cc:* dev <dev@spark.apache.org>
> *Subject:* Re: Spark 3 is Slower than Spark 2 for TPCDS Q04 query.
>
>
>
> Also we checked that we have already backported
> https://issues.apache.org/jira/browse/SPARK-33557 jira.
>
>
>
> On Mon, Dec 20, 2021 at 11:08 AM Senthil Kumar <sen...@gmail.com> wrote:
>
> @abhishek. We use spark 3.1*
>
>
>
> On Mon, 20 Dec 2021, 09:50 Rao, Abhishek (Nokia - IN/Bangalore), <
> abhishek....@nokia.com> wrote:
>
> Hi Senthil,
>
>
>
> Which version of Spark 3 are we using? We had this kind of observation
> with Spark 3.0.2 and 3.1.x, but then we figured out that we had configured
> big value for spark.network.timeout and this value was not taking effect in
> all releases prior to 3.0.2.
>
> This was fixed as part of
> https://issues.apache.org/jira/browse/SPARK-33557. Because we had
> configured big value for spark.network.timeout, this was resulting in TPCDS
> queries taking long time when tried with Spark 3.0.2 and 3.1.x. Once we
> corrected it, we observed that the queries were executed much faster.
>
>
>
> Thanks and Regards,
>
> Abhishek
>
>
>
> *From:* Senthil Kumar <sen...@gmail.com>
> *Sent:* Sunday, December 19, 2021 11:58 PM
> *To:* dev <dev@spark.apache.org>
> *Subject:* Spark 3 is Slower than Spark 2 for TPCDS Q04 query.
>
>
>
> Hi All,
>
> We are comparing Spark 2.4.5 and Spark 3(without enabling spark 3
> additional features) with TPCDS queries and found that Spark 3's
> performance is reduced to at-least 30-40% compared to Spark 2.4.5.
>
>
>
> Eg.
>
> Data size used 1TB
>
>
> Spark 2.4.5 finishes the Q4 in 1.5 min, but Spark 3.* takes at-least 2.5
> min.
>
>
>
> Note: We tested this in the same cluster with the same size of data. And
> we ensured that parameters we passed are one and the same for SPark 2.4*
> and Spark 3*.
>
>
>
> It will be helpful, if any one you also encountered the same issue in your
> benchmarking activities? If so, pls share your input on what could be the
> reason behind this poor performance.
>
>
>
> --
>
> Senthil kumar
>
>
>
>
> --
>
> Senthil kumar
>

Re: Spark 3 is Slower than Spark 2 for TPCDS Q04 query.

Reply via email to