Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-31 Thread Tin Vu
t;>- Data is stored by Hive with ORC format. >> >> I executed a very simple SQL query: "SELECT * from table_name" >> The issue is that for some small size tables (even table with few dozen >> of records), SparkSQL still required about 7-8 seconds to finish, w

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-31 Thread Gourav Sengupta
le SQL query: "SELECT * from table_name" > The issue is that for some small size tables (even table with few dozen of > records), SparkSQL still required about 7-8 seconds to finish, while Drill > and Presto only needed less than 1 second. > For other large tables with billions recor

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-29 Thread Lalwani, Jayesh
ser@spark.apache.org" <user@spark.apache.org> Subject: Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto You are right. There are too much tasks was created. How can we reduce the number of tasks? On Thu, Mar 29, 2018, 7:44 AM Lalwani,

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-29 Thread Tin Vu
*Wednesday, March 28, 2018 at 8:04 PM > *To: *"user@spark.apache.org" <user@spark.apache.org> > *Subject: *[SparkSQL] SparkSQL performance on small TPCDS tables is very > low when compared to Drill or Presto > > > > Hi, > > > > I am execut

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-29 Thread Lalwani, Jayesh
UI. From: Tin Vu <tvu...@ucr.edu> Date: Wednesday, March 28, 2018 at 8:04 PM To: "user@spark.apache.org" <user@spark.apache.org> Subject: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto Hi, I am executing a benchmark

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Tin Vu
mat. > > I executed a very simple SQL query: "SELECT * from table_name" > The issue is that for some small size tables (even table with few dozen of > records), SparkSQL still required about 7-8 seconds to finish, while Drill > and Presto only needed less than 1 second. > For o

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Jörn Franke
s to finish, while Drill > and Presto only needed less than 1 second. > For other large tables with billions records, SparkSQL performance was > reasonable when it required 20-30 seconds to scan the whole table. > Do you have any idea or reasonable explanation for this issue? > Thanks, >

[SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Tin Vu
n 1 second. For other large tables with billions records, SparkSQL performance was reasonable when it required 20-30 seconds to scan the whole table. Do you have any idea or reasonable explanation for this issue? Thanks,

Re: Spark Streaming - Multiple Spark Contexts (SparkSQL) Performance

2017-10-01 Thread Gerard Maas
Hammad, The recommended way to implement this logic would be to: Create a SparkSession. Create a Streaming Context using the SparkContext embedded in the SparkSession Use the single SparkSession instance for the SQL operations within the foreachRDD. It's important to note that spark operations

Fwd: Spark Streaming - Multiple Spark Contexts (SparkSQL) Performance

2017-10-01 Thread Hammad
Hello, *Background:* I have Spark Streaming context; SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("TransformerStreamPOC"); conf.set("spark.driver.allowMultipleContexts", "true"); *<== this* JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(60));

Re: SparkSQL performance

2015-04-22 Thread Michael Armbrust
https://github.com/databricks/spark-avro On Tue, Apr 21, 2015 at 3:09 PM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Thanks Michael! I have tried applying my schema programatically but I didn't get any improvement on performance :( Could you point me to some code examples

Re: SparkSQL performance

2015-04-21 Thread Michael Armbrust
Here is an example using rows directly: https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema Avro or parquet input would likely give you the best performance. On Tue, Apr 21, 2015 at 4:28 AM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com

Re: SparkSQL performance

2015-04-21 Thread Renato Marroquín Mogrovejo
Thanks Michael! I have tried applying my schema programatically but I didn't get any improvement on performance :( Could you point me to some code examples using Avro please? Many thanks again! Renato M. 2015-04-21 20:45 GMT+02:00 Michael Armbrust mich...@databricks.com: Here is an example

Re: SparkSQL performance

2015-04-21 Thread Renato Marroquín Mogrovejo
Thanks for the hints guys! much appreciated! Even if I just do a something like: Select * from tableX where attribute1 5 I see similar behaviour. @Michael Could you point me to any sample code that uses Spark's Rows? We are at a phase where we can actually change our JavaBeans for something

Re: SparkSQL performance

2015-04-20 Thread Renato Marroquín Mogrovejo
Does anybody have an idea? a clue? a hint? Thanks! Renato M. 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com: Hi all, I have a simple query Select * from tableX where attribute1 between 0 and 5 that I run over a Kryo file with four partitions that ends up

Re: SparkSQL performance

2015-04-20 Thread ayan guha
SparkSQL optimizes better by column pruning and predicate pushdown, primarily. Here you are not taking advantage of either. I am curious to know what goes in your filter function, as you are not using a filter in SQL side. Best Ayan On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo

Re: SparkSQL performance

2015-04-20 Thread Michael Armbrust
There is a cost to converting from JavaBeans to Rows and this code path has not been optimized. That is likely what you are seeing. On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote: SparkSQL optimizes better by column pruning and predicate pushdown, primarily. Here you are

SparkSQL performance

2015-04-20 Thread Renato Marroquín Mogrovejo
Hi all, I have a simple query Select * from tableX where attribute1 between 0 and 5 that I run over a Kryo file with four partitions that ends up being around 3.5 million rows in our case. If I run this query by doing a simple map().filter() it takes around ~9.6 seconds but when I apply schema,

SparkSQL Performance Tuning Options

2015-01-27 Thread Manoj Samel
Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db. Use case is Spark Yarn app will start and serve as query server for multiple users i.e. always up and running. At startup, there is option to cache data and also pre-compute some results sets, hash maps etc. that would be

Re: SparkSQL Performance Tuning Options

2015-01-27 Thread Cheng Lian
On 1/27/15 5:55 PM, Cheng Lian wrote: On 1/27/15 11:38 AM, Manoj Samel wrote: Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db. Use case is Spark Yarn app will start and serve as query server for multiple users i.e. always up and running. At startup, there is option

Re: SparkSQL performance

2014-11-03 Thread Marius Soutier
Subject: SparkSQL performance I was really surprised to see the results here, esp. SparkSQL not completing http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style I was under the impression that SparkSQL performs really well because it can optimize the RDD operations and load only

SparkSQL performance

2014-10-31 Thread Soumya Simanta
I was really surprised to see the results here, esp. SparkSQL not completing http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style I was under the impression that SparkSQL performs really well because it can optimize the RDD operations and load only the columns that are required.

Re: SparkSQL performance

2014-10-31 Thread Du Li
From: Soumya Simanta soumya.sima...@gmail.commailto:soumya.sima...@gmail.com Date: Friday, October 31, 2014 at 4:04 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: SparkSQL performance I was really surprised to see the results

Re: SparkSQL performance

2014-10-31 Thread Soumya Simanta
Subject: SparkSQL performance I was really surprised to see the results here, esp. SparkSQL not completing http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style I was under the impression that SparkSQL performs really well because it can optimize the RDD operations