Re: spark sql versus interactive hive versus hive

Saikat Kanjilal Sat, 11 Feb 2017 07:54:58 -0800

Thanks Jorn for the input, our users want to run queries that perform large 
aggregations of data from different tables as well as simple ad hockey queries 
over 1 table.  The tables are all in orc format, they're currently using the 
hive plus tez architecture that you mention but experiencing perf issues, one 
of the things we're considering is to move them to spark sql where it makes 
sense which is why I wanted to know people's experience in using the various 
tools.


Sent from my iPhone

On Feb 11, 2017, at 12:22 AM, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:

I think this is a rather simplistic view. All the tools to computation 
in-memory in the end. For certain type of computation and usage patterns it 
makes sense to keep them in memory. For example, most of the machine learning 
approaches require to include the same data in several iterative calculations. 
This is what Spark has been designed for. Most aggregations/precalculations are 
just done by using the data in-memory once. Here is where Hive+Tez and to a 
limited extend Spark can help. The third pattern, where users interactively 
query the data i.e. Many concurrent users query the same or similar data very 
frequently, is addressed by Hive on Tez + Llap, Hive Tez+ Ignite or Spark + 
ignite ( and there are other tools).

So it is important to understand what your users want to do.

Then, you have a lot of benchmark data on the web to compare. However I always 
recommend to generate or use data yourself that fits to the data the company is 
using. Keep also in mind that time is needed to convert this data in a 
efficient format.

On 10 Feb 2017, at 20:36, Saikat Kanjilal 
<sxk1...@hotmail.com<mailto:sxk1...@hotmail.com>> wrote:


Folks,

I'm embarking on a project to build a POC around spark sql, I was wondering if 
anyone has experience in comparing spark sql with hive or interactive hive and 
data points around the types of queries suited for both, I am naively assuming 
that spark sql will beat hive in all queries given that computations are mostly 
done in memory but want to hear some more data  points around queries that 
maybe problematic in spark-sql, also are there debugging tools people 
ordinarily use with spark-sql to troubleshoot perf related issues.


I look forward to hearing from the community.

Regards

Re: spark sql versus interactive hive versus hive

Reply via email to