[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times

Maziyar PANAHI (Jira) Wed, 14 Apr 2021 00:17:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Maziyar PANAHI updated SPARK-35066:
-----------------------------------
    Attachment: Screenshot 2021-04-08 at 15.13.19.png
                Screenshot 2021-04-08 at 15.08.09.png
                Screenshot 2021-04-07 at 11.15.48.png

> Spark 3.1.1 is slower than 3.0.2 by 4-5 times
> ---------------------------------------------
>
>                 Key: SPARK-35066
>                 URL: https://issues.apache.org/jira/browse/SPARK-35066
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, SQL
>    Affects Versions: 3.1.1
>         Environment: Spark/PySpark: 3.1.1
> Language: Python 3.7.x / Scala 12
> OS: macOS, Linux, and Windows
> Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1
>            Reporter: Maziyar PANAHI
>            Priority: Major
>         Attachments: Screenshot 2021-04-07 at 11.15.48.png, Screenshot 
> 2021-04-08 at 15.08.09.png, Screenshot 2021-04-08 at 15.13.19.png
>
>
> Hi,
> The following snippet code runs 4-5 times slower when it's used in Apache 
> Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:
>  
> {code:java}
> spark = SparkSession.builder \
>         .master("local[*]") \
>         .config("spark.driver.memory", "16G") \
>         .config("spark.driver.maxResultSize", "0") \
>         .config("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer") \
>         .config("spark.kryoserializer.buffer.max", "2000m") \
>         .getOrCreate()
> Toys = spark.read \
>   .parquet('./toys-cleaned').repartition(12)
> # tokenize the text
> regexTokenizer = RegexTokenizer(inputCol="reviewText",
> outputCol="all_words", pattern="\\W")
> toys_with_words = regexTokenizer.transform(Toys)
> # remove stop words
> remover = StopWordsRemover(inputCol="all_words", outputCol="words")
> toys_with_tokens = remover.transform(toys_with_words).drop("all_words")
> all_words = toys_with_tokens.select(explode("words").alias("word"))
> # group by, sort and limit to 50k
> top50k =
> all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000)
> top50k.show()
> {code}
>  
> Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
> partitions are respected in a way that all 12 tasks are being processed all 
> together. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
> them finish immediately and only 2 are being processed. (I've tried to 
> disable a couple of configs related to something similar, but none of them 
> worked)
> Screenshot of spark 3.1.1 task: [Spark UI 
> 3.1.1|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png]
> Screenshot of spark 3.0.2 task: [Spark UI 
> 3.0.2|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png]
> For a longer discussion: [Spark User 
> List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html]
>  
> You can reproduce this big difference of performance between Spark 3.1.1 and 
> Spark 3.0.2 by using the shared code with any dataset that is large enough to 
> take longer than a minute. Not sure if this is related to SQL, any Spark 
> config being enabled in 3.x but not really into action before 3.1.1, or it's 
> about .transform in Spark ML.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times

Reply via email to