[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349551#comment-16349551 ]
Thomas Graves commented on SPARK-23309: --------------------------------------- seeing the same time difference after adding in the spark.table("dailyCached").count() [~dongjoon] Correct this is only when read from cached data. Without caching spark 2.3 is quite a bit faster (1.5-2x+) then spark 2.2 when reading from hive using orc. (which is awesome, thanks for all the work!) I'm running now with --conf spark.sql.orc.impl=hive --conf spark.sql.hive.convertMetastoreOrc=false. For the smaller data set it did get closer, only 1 second diff on average between spark 2.2 and spark 2.3. Trying to run on the larger dataset now. I'm wondering if much of the difference is the larger # of partitions you get with hive native in spark 2.3 > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -------------------------------------------------------------- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Reporter: Thomas Graves > Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org