[
https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Maziyar PANAHI updated SPARK-35066:
-----------------------------------
Attachment: Screenshot 2021-04-08 at 15.13.19.png
Screenshot 2021-04-08 at 15.08.09.png
Screenshot 2021-04-07 at 11.15.48.png
> Spark 3.1.1 is slower than 3.0.2 by 4-5 times
> ---------------------------------------------
>
> Key: SPARK-35066
> URL: https://issues.apache.org/jira/browse/SPARK-35066
> Project: Spark
> Issue Type: Bug
> Components: ML, SQL
> Affects Versions: 3.1.1
> Environment: Spark/PySpark: 3.1.1
> Language: Python 3.7.x / Scala 12
> OS: macOS, Linux, and Windows
> Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1
> Reporter: Maziyar PANAHI
> Priority: Major
> Attachments: Screenshot 2021-04-07 at 11.15.48.png, Screenshot
> 2021-04-08 at 15.08.09.png, Screenshot 2021-04-08 at 15.13.19.png
>
>
> Hi,
> The following snippet code runs 4-5 times slower when it's used in Apache
> Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:
>
> {code:java}
> spark = SparkSession.builder \
> .master("local[*]") \
> .config("spark.driver.memory", "16G") \
> .config("spark.driver.maxResultSize", "0") \
> .config("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer") \
> .config("spark.kryoserializer.buffer.max", "2000m") \
> .getOrCreate()
> Toys = spark.read \
> .parquet('./toys-cleaned').repartition(12)
> # tokenize the text
> regexTokenizer = RegexTokenizer(inputCol="reviewText",
> outputCol="all_words", pattern="\\W")
> toys_with_words = regexTokenizer.transform(Toys)
> # remove stop words
> remover = StopWordsRemover(inputCol="all_words", outputCol="words")
> toys_with_tokens = remover.transform(toys_with_words).drop("all_words")
> all_words = toys_with_tokens.select(explode("words").alias("word"))
> # group by, sort and limit to 50k
> top50k =
> all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000)
> top50k.show()
> {code}
>
> Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12
> partitions are respected in a way that all 12 tasks are being processed all
> together. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of
> them finish immediately and only 2 are being processed. (I've tried to
> disable a couple of configs related to something similar, but none of them
> worked)
> Screenshot of spark 3.1.1 task: [Spark UI
> 3.1.1|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png]
> Screenshot of spark 3.0.2 task: [Spark UI
> 3.0.2|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png]
> For a longer discussion: [Spark User
> List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html]
>
> You can reproduce this big difference of performance between Spark 3.1.1 and
> Spark 3.0.2 by using the shared code with any dataset that is large enough to
> take longer than a minute. Not sure if this is related to SQL, any Spark
> config being enabled in 3.x but not really into action before 3.1.1, or it's
> about .transform in Spark ML.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]