[
https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063406#comment-16063406
]
Saif Addin edited comment on SPARK-21198 at 6/26/17 5:02 PM:
-------------------------------------------------------------
Okay, I think there is something odd somewhere in between. It may be hard to
tackle the issue but, i'll start slowly line to line.
(*spark-submit*, local[8])
1. This line, gets stuck forever and the program doesnt continue after waiting
2 minutes (spark-task is stuck in collect())
{code:java}
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +:
(spark.catalog.listDatabases.filter(_ != blacklistdb).map(_.name).collect)
{code}
2. Changing such line to (notice filter after collect) takes 8ms
{code:java}
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +:
(spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb))
{code}
3. The following line, takes instead 2ms
{code:java}
private val databases: Array[String] = (REFERENCEDB +: spark.sql("show
databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb)
{code}
Here's the weirdest of all, if I instead start *spark-shell* and do item number
1, it works (takes 1ms which is even faster than in my program. the other lines
are also faster down to 1ms as well)
was (Author: revolucion09):
Okay, I think there is something odd somewhere in between. It may be hard to
tackle the issue but, i'll start slowly line to line.
(*spark-submit*, local[8])
1. This line, gets stuck forever and the program doesnt continue after waiting
2 minutes (spark-task is stuck in collect())
{code:java}
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +:
(spark.catalog.listDatabases.filter(_ != blacklistdb).map(_.name).collect)
{code}
2. Changing such line to (notice filter after collect) takes 8ms
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +:
(spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb))
3. The following line, takes instead 2ms
private val databases: Array[String] = (REFERENCEDB +: spark.sql("show
databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb)
Here's the weirdest of all, if I instead start *spark-shell* and do item number
1, it works (takes 1ms which is even faster than in my program. the other lines
are also faster down to 1ms as well)
> SparkSession catalog is terribly slow
> -------------------------------------
>
> Key: SPARK-21198
> URL: https://issues.apache.org/jira/browse/SPARK-21198
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.1.0
> Reporter: Saif Addin
>
> We have a considerably large Hive metastore and a Spark program that goes
> through Hive data availability.
> In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and
> sqlContext.isCached() to go throgh Hive metastore information.
> Once migrated to spark 2.x we switched over SparkSession.catalog instead, but
> it turns out that both listDatabases() and listTables() take between 5 to 20
> minutes depending on the database to return results, using operations such as
> the following one:
> spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect
> and made the program unbearably slow to return a list of tables.
> I know we still have spark.sqlContext.tableNames as workaround but I am
> assuming this is going to be deprecated anytime soon?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]