[ https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Erik Krogen updated SPARK-35066: -------------------------------- Description: Hi, The following snippet code runs 4-5 times slower when it's used in Apache Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2: {code:java} spark = SparkSession.builder \ .master("local[*]") \ .config("spark.driver.memory", "16G") \ .config("spark.driver.maxResultSize", "0") \ .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ .config("spark.kryoserializer.buffer.max", "2000m") \ .getOrCreate() Toys = spark.read \ .parquet('./toys-cleaned').repartition(12) # tokenize the text regexTokenizer = RegexTokenizer(inputCol="reviewText", outputCol="all_words", pattern="\\W") toys_with_words = regexTokenizer.transform(Toys) # remove stop words remover = StopWordsRemover(inputCol="all_words", outputCol="words") toys_with_tokens = remover.transform(toys_with_words).drop("all_words") all_words = toys_with_tokens.select(explode("words").alias("word")) # group by, sort and limit to 50k top50k = all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000) top50k.show() {code} Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 partitions are respected in a way that all 12 tasks are being processed altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of them finish immediately and only 2 are being processed. (I've tried to disable a couple of configs related to something similar, but none of them worked) Screenshot of spark 3.1.1 task: !image-2022-03-17-17-18-36-793.png|width=1073,height=652! !image-2022-03-17-17-19-11-655.png! Screenshot of spark 3.0.2 task: !image-2022-03-17-17-19-34-906.png! For a longer discussion: [Spark User List |https://lists.apache.org/thread/1hlg9fpxnw8dzx8bd2fvffmk7yozoszf] You can reproduce this big difference of performance between Spark 3.1.1 and Spark 3.0.2 by using the shared code with any dataset that is large enough to take longer than a minute. Not sure if this is related to SQL, any Spark config being enabled in 3.x but not really into action before 3.1.1, or it's about .transform in Spark ML. was: Hi, The following snippet code runs 4-5 times slower when it's used in Apache Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2: {code:java} spark = SparkSession.builder \ .master("local[*]") \ .config("spark.driver.memory", "16G") \ .config("spark.driver.maxResultSize", "0") \ .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ .config("spark.kryoserializer.buffer.max", "2000m") \ .getOrCreate() Toys = spark.read \ .parquet('./toys-cleaned').repartition(12) # tokenize the text regexTokenizer = RegexTokenizer(inputCol="reviewText", outputCol="all_words", pattern="\\W") toys_with_words = regexTokenizer.transform(Toys) # remove stop words remover = StopWordsRemover(inputCol="all_words", outputCol="words") toys_with_tokens = remover.transform(toys_with_words).drop("all_words") all_words = toys_with_tokens.select(explode("words").alias("word")) # group by, sort and limit to 50k top50k = all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000) top50k.show() {code} Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 partitions are respected in a way that all 12 tasks are being processed altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of them finish immediately and only 2 are being processed. (I've tried to disable a couple of configs related to something similar, but none of them worked) Screenshot of spark 3.1.1 task: !image-2022-03-17-17-18-36-793.png|width=1073,height=652! !image-2022-03-17-17-19-11-655.png! Screenshot of spark 3.0.2 task: !image-2022-03-17-17-19-34-906.png! For a longer discussion: [Spark User List |https://lists.apache.org/thread/1bslwjdwnr5tw7wjkv0672vj41x4g2f1] You can reproduce this big difference of performance between Spark 3.1.1 and Spark 3.0.2 by using the shared code with any dataset that is large enough to take longer than a minute. Not sure if this is related to SQL, any Spark config being enabled in 3.x but not really into action before 3.1.1, or it's about .transform in Spark ML. > Spark 3.1.1 is slower than 3.0.2 by 4-5 times > --------------------------------------------- > > Key: SPARK-35066 > URL: https://issues.apache.org/jira/browse/SPARK-35066 > Project: Spark > Issue Type: Bug > Components: ML, SQL > Affects Versions: 3.1.1 > Environment: Spark/PySpark: 3.1.1 > Language: Python 3.7.x / Scala 12 > OS: macOS, Linux, and Windows > Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1 > Reporter: Maziyar PANAHI > Priority: Major > Attachments: Screenshot 2021-04-07 at 11.15.48.png, Screenshot > 2021-04-08 at 15.08.09.png, Screenshot 2021-04-08 at 15.13.19-1.png, > Screenshot 2021-04-08 at 15.13.19.png, image-2022-03-17-17-18-36-793.png, > image-2022-03-17-17-19-11-655.png, image-2022-03-17-17-19-34-906.png > > > Hi, > The following snippet code runs 4-5 times slower when it's used in Apache > Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2: > > {code:java} > spark = SparkSession.builder \ > .master("local[*]") \ > .config("spark.driver.memory", "16G") \ > .config("spark.driver.maxResultSize", "0") \ > .config("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") \ > .config("spark.kryoserializer.buffer.max", "2000m") \ > .getOrCreate() > Toys = spark.read \ > .parquet('./toys-cleaned').repartition(12) > # tokenize the text > regexTokenizer = RegexTokenizer(inputCol="reviewText", > outputCol="all_words", pattern="\\W") > toys_with_words = regexTokenizer.transform(Toys) > # remove stop words > remover = StopWordsRemover(inputCol="all_words", outputCol="words") > toys_with_tokens = remover.transform(toys_with_words).drop("all_words") > all_words = toys_with_tokens.select(explode("words").alias("word")) > # group by, sort and limit to 50k > top50k = > all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000) > top50k.show() > {code} > > Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 > partitions are respected in a way that all 12 tasks are being processed > altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 > of them finish immediately and only 2 are being processed. (I've tried to > disable a couple of configs related to something similar, but none of them > worked) > Screenshot of spark 3.1.1 task: > !image-2022-03-17-17-18-36-793.png|width=1073,height=652! > !image-2022-03-17-17-19-11-655.png! > > Screenshot of spark 3.0.2 task: > > !image-2022-03-17-17-19-34-906.png! > For a longer discussion: [Spark User List > |https://lists.apache.org/thread/1hlg9fpxnw8dzx8bd2fvffmk7yozoszf] > > You can reproduce this big difference of performance between Spark 3.1.1 and > Spark 3.0.2 by using the shared code with any dataset that is large enough to > take longer than a minute. Not sure if this is related to SQL, any Spark > config being enabled in 3.x but not really into action before 3.1.1, or it's > about .transform in Spark ML. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org