[jira] [Comment Edited] (SPARK-21198) SparkSession catalog is terribly slow

Saif Addin (JIRA) Mon, 26 Jun 2017 10:25:35 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063431#comment-16063431
 ]


Saif Addin edited comment on SPARK-21198 at 6/26/17 5:24 PM:
-------------------------------------------------------------

Regarding listtables, here is the  code used inside the program:

{code:java}

 println(s"processing all tables for every db. db length is: 
${databases.tail.length}")
      for (d <- databases.tail) {
        val d1 = new Date().getTime
        val dbs = spark.sqlContext.tables(d).filter("isTemporary = 
false").select("tableName").collect.map(_.getString(0))
        println("Processed tables in DB using sqlContext. Time: " + ((new 
Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables")
        val d2 = new Date().getTime
        val dbs2 = 
spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect
        println("Processed tables in DB using catalog. Time: " + ((new 
Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables")
        ....other stuff
{code}

and timings are as follows 

{code:java}

processing all tables for every db. db length is: 30
Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables
Processed tables in DB using catalog. Time: 6.978 seconds. 19 tables
Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables
Processed tables in DB using catalog. Time: 194.501 seconds. 607 tables
Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables
Processed tables in DB using catalog. Time: 17.907 seconds. 55 tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables
Processed tables in DB using catalog. Time: 4.642 seconds. 13 tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables
Processed tables in DB using catalog. Time: 126.999 seconds. 392 tables
... goes on...
{code}

in stand-alone spark-shell, as opposed to full-program spark-submit:

{code:java}

import java.util.Date
val dbs = spark.catalog.listDatabases.map(_.name).collect
for (d <- dbs) {
        val d1 = new Date().getTime
        val dbs = spark.sqlContext.tables(d).filter("isTemporary = 
false").select("tableName").collect.map(_.getString(0))
        println("Processed tables in DB using sqlContext. Time: " + ((new 
Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables")
        val d2 = new Date().getTime
        val dbs2 = 
spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect
        println("Processed tables in DB using catalog. Time: " + ((new 
Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables")
}
{code}


{code:java}
Processed tables in DB using sqlContext. Time: 0.59 seconds. 19 tables
Processed tables in DB using catalog. Time: 6.285 seconds. 19 tables
Processed tables in DB using sqlContext. Time: 0.276 seconds. 608 tables
Processed tables in DB using catalog. Time: 201.295 seconds. 608 tables
Processed tables in DB using sqlContext. Time: 0.241 seconds. 55 tables
... goes on. timings similar
{code}

So. Apart from the weird listDatabases issue, listTables is consistently slow.



was (Author: revolucion09):
Regarding listtables, here is the  code used inside the program:

{code:java}

 println(s"processing all tables for every db. db length is: 
${databases.tail.length}")
      for (d <- databases.tail) {
        val d1 = new Date().getTime
        val dbs = spark.sqlContext.tables(d).filter("isTemporary = 
false").select("tableName").collect.map(_.getString(0))
        println("Processed tables in DB using sqlContext. Time: " + ((new 
Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables")
        val d2 = new Date().getTime
        val dbs2 = 
spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect
        println("Processed tables in DB using catalog. Time: " + ((new 
Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables")
        ....other stuff
{code}

and timings are as follows 

{code:java}

processing all tables for every db. db length is: 30
Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables
Processed tables in DB using catalog. Time: {color:red}6.978 seconds{color}. 19 
tables
Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables
Processed tables in DB using catalog. Time: {color:red}194.501 seconds{color}. 
607 tables
Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables
Processed tables in DB using catalog. Time: {color:red}17.907 seconds{color}. 
55 tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables
Processed tables in DB using catalog. Time: {color:red}4.642 seconds{color}. 13 
tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables
Processed tables in DB using catalog. Time: {color:red}126.999 seconds{color}. 
392 tables
... goes on...
{code}

in stand-alone spark-shell, as opposed to full-program spark-submit:

{code:java}

import java.util.Date
val dbs = spark.catalog.listDatabases.map(_.name).collect
for (d <- dbs) {
        val d1 = new Date().getTime
        val dbs = spark.sqlContext.tables(d).filter("isTemporary = 
false").select("tableName").collect.map(_.getString(0))
        println("Processed tables in DB using sqlContext. Time: " + ((new 
Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables")
        val d2 = new Date().getTime
        val dbs2 = 
spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect
        println("Processed tables in DB using catalog. Time: " + ((new 
Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables")
}
{code}


{code:java}
Processed tables in DB using sqlContext. Time: 0.59 seconds. 19 tables
Processed tables in DB using catalog. Time: {color:red}6.285 seconds{color}. 19 
tables
Processed tables in DB using sqlContext. Time: 0.276 seconds. 608 tables
Processed tables in DB using catalog. Time: {color:red}201.295 seconds{color}. 
608 tables
Processed tables in DB using sqlContext. Time: 0.241 seconds. 55 tables
... goes on. timings similar
{code}

So. Apart from the weird listDatabases issue, listTables is consistently slow.


> SparkSession catalog is terribly slow
> -------------------------------------
>
>                 Key: SPARK-21198
>                 URL: https://issues.apache.org/jira/browse/SPARK-21198
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Saif Addin
>
> We have a considerably large Hive metastore and a Spark program that goes 
> through Hive data availability.
> In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and 
> sqlContext.isCached() to go throgh Hive metastore information.
> Once migrated to spark 2.x we switched over SparkSession.catalog instead, but 
> it turns out that both listDatabases() and listTables() take between 5 to 20 
> minutes depending on the database to return results, using operations such as 
> the following one:
> spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect
> and made the program unbearably slow to return a list of tables.
> I know we still have spark.sqlContext.tableNames as workaround but I am 
> assuming this is going to be deprecated anytime soon?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-21198) SparkSession catalog is terribly slow

Reply via email to