Re: SparkSQL Performance Tuning Options

Cheng Lian Tue, 27 Jan 2015 23:45:06 -0800


On 1/27/15 5:55 PM, Cheng Lian wrote:

On 1/27/15 11:38 AM, Manoj Samel wrote:
Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db.
Use case is Spark Yarn app will start and serve as query server formultiple users i.e. always up and running. At startup, there isoption to cache data and also pre-compute some results sets, hashmaps etc. that would be likely be asked by client APIs. I.e there issome option to use startup time to precompute/cache - but queryresponse time requirement on large data set is very stringent
Hoping to use SparkSQL (but a combination of SQL and RDD APIs is alsoOK).
* Does SparkSQL execution uses underlying partition information ?(Data is from HDFS)
No. For example, if the underlying data has already been partitionedby some key, Spark SQL doesn't know it, and can't leverage thatinformation to avoid shuffle when doing aggregation on that key.However, partitioning the data ahead of time does help minimizingshuffle network IO. There's a JIRA ticket to enable Spark SQL aware ofunderlying data distribution.

Maybe you are asking about locality? If that's the case, just want toadd that Spark SQL does understand locality information of theunderlying data. It's obtained from Hadoop InputFormat.

* Are there any ways to give "hints" to the SparkSQL execution aboutany precomputed/pre-cached RDDs?
Instead of caching raw RDD, it's recommended to transform raw RDD toSchemaRDD and then cache it, so that in-memory columnar storage can beused. Also Spark SQL recognizes cached SchemaRDDs automatically.
* Packages spark.sql.execution, spark.sql.execution.joins and othersql.xxx packages - would using these for tuning query plan isrecommended? Would like to keep this as-needed if possible
Not sure whether I understood this question. Are you trying to useinternal APIs to do customized optimizations?
* Features not in current release but scheduled for upcoming releasewould also be good to know.
Thanks,
PS: This is not a small topic so if someone prefers to start aoffline thread on details, I can do that and summarize theconclusions back to this thread.



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SparkSQL Performance Tuning Options

Reply via email to