[
https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063431#comment-16063431
]
Saif Addin edited comment on SPARK-21198 at 6/26/17 5:24 PM:
-------------------------------------------------------------
Regarding listtables, here is the code used inside the program:
{code:java}
println(s"processing all tables for every db. db length is:
${databases.tail.length}")
for (d <- databases.tail) {
val d1 = new Date().getTime
val dbs = spark.sqlContext.tables(d).filter("isTemporary =
false").select("tableName").collect.map(_.getString(0))
println("Processed tables in DB using sqlContext. Time: " + ((new
Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables")
val d2 = new Date().getTime
val dbs2 =
spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect
println("Processed tables in DB using catalog. Time: " + ((new
Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables")
....other stuff
{code}
and timings are as follows
{code:java}
processing all tables for every db. db length is: 30
Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables
Processed tables in DB using catalog. Time: 6.978 seconds. 19 tables
Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables
Processed tables in DB using catalog. Time: 194.501 seconds. 607 tables
Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables
Processed tables in DB using catalog. Time: 17.907 seconds. 55 tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables
Processed tables in DB using catalog. Time: 4.642 seconds. 13 tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables
Processed tables in DB using catalog. Time: 126.999 seconds. 392 tables
... goes on...
{code}
in stand-alone spark-shell, as opposed to full-program spark-submit:
{code:java}
import java.util.Date
val dbs = spark.catalog.listDatabases.map(_.name).collect
for (d <- dbs) {
val d1 = new Date().getTime
val dbs = spark.sqlContext.tables(d).filter("isTemporary =
false").select("tableName").collect.map(_.getString(0))
println("Processed tables in DB using sqlContext. Time: " + ((new
Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables")
val d2 = new Date().getTime
val dbs2 =
spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect
println("Processed tables in DB using catalog. Time: " + ((new
Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables")
}
{code}
{code:java}
Processed tables in DB using sqlContext. Time: 0.59 seconds. 19 tables
Processed tables in DB using catalog. Time: 6.285 seconds. 19 tables
Processed tables in DB using sqlContext. Time: 0.276 seconds. 608 tables
Processed tables in DB using catalog. Time: 201.295 seconds. 608 tables
Processed tables in DB using sqlContext. Time: 0.241 seconds. 55 tables
... goes on. timings similar
{code}
So. Apart from the weird listDatabases issue, listTables is consistently slow.
was (Author: revolucion09):
Regarding listtables, here is the code used inside the program:
{code:java}
println(s"processing all tables for every db. db length is:
${databases.tail.length}")
for (d <- databases.tail) {
val d1 = new Date().getTime
val dbs = spark.sqlContext.tables(d).filter("isTemporary =
false").select("tableName").collect.map(_.getString(0))
println("Processed tables in DB using sqlContext. Time: " + ((new
Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables")
val d2 = new Date().getTime
val dbs2 =
spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect
println("Processed tables in DB using catalog. Time: " + ((new
Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables")
....other stuff
{code}
and timings are as follows
{code:java}
processing all tables for every db. db length is: 30
Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables
Processed tables in DB using catalog. Time: {color:red}6.978 seconds{color}. 19
tables
Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables
Processed tables in DB using catalog. Time: {color:red}194.501 seconds{color}.
607 tables
Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables
Processed tables in DB using catalog. Time: {color:red}17.907 seconds{color}.
55 tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables
Processed tables in DB using catalog. Time: {color:red}4.642 seconds{color}. 13
tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables
Processed tables in DB using catalog. Time: {color:red}126.999 seconds{color}.
392 tables
... goes on...
{code}
in stand-alone spark-shell, as opposed to full-program spark-submit:
{code:java}
import java.util.Date
val dbs = spark.catalog.listDatabases.map(_.name).collect
for (d <- dbs) {
val d1 = new Date().getTime
val dbs = spark.sqlContext.tables(d).filter("isTemporary =
false").select("tableName").collect.map(_.getString(0))
println("Processed tables in DB using sqlContext. Time: " + ((new
Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables")
val d2 = new Date().getTime
val dbs2 =
spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect
println("Processed tables in DB using catalog. Time: " + ((new
Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables")
}
{code}
{code:java}
Processed tables in DB using sqlContext. Time: 0.59 seconds. 19 tables
Processed tables in DB using catalog. Time: {color:red}6.285 seconds{color}. 19
tables
Processed tables in DB using sqlContext. Time: 0.276 seconds. 608 tables
Processed tables in DB using catalog. Time: {color:red}201.295 seconds{color}.
608 tables
Processed tables in DB using sqlContext. Time: 0.241 seconds. 55 tables
... goes on. timings similar
{code}
So. Apart from the weird listDatabases issue, listTables is consistently slow.
> SparkSession catalog is terribly slow
> -------------------------------------
>
> Key: SPARK-21198
> URL: https://issues.apache.org/jira/browse/SPARK-21198
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.1.0
> Reporter: Saif Addin
>
> We have a considerably large Hive metastore and a Spark program that goes
> through Hive data availability.
> In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and
> sqlContext.isCached() to go throgh Hive metastore information.
> Once migrated to spark 2.x we switched over SparkSession.catalog instead, but
> it turns out that both listDatabases() and listTables() take between 5 to 20
> minutes depending on the database to return results, using operations such as
> the following one:
> spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect
> and made the program unbearably slow to return a list of tables.
> I know we still have spark.sqlContext.tableNames as workaround but I am
> assuming this is going to be deprecated anytime soon?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]