t;>- Data is stored by Hive with ORC format.
>>
>> I executed a very simple SQL query: "SELECT * from table_name"
>> The issue is that for some small size tables (even table with few dozen
>> of records), SparkSQL still required about 7-8 seconds to finish, w
le SQL query: "SELECT * from table_name"
> The issue is that for some small size tables (even table with few dozen of
> records), SparkSQL still required about 7-8 seconds to finish, while Drill
> and Presto only needed less than 1 second.
> For other large tables with billions recor
ser@spark.apache.org" <user@spark.apache.org>
Subject: Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low
when compared to Drill or Presto
You are right. There are too much tasks was created. How can we reduce the
number of tasks?
On Thu, Mar 29, 2018, 7:44 AM Lalwani,
*Wednesday, March 28, 2018 at 8:04 PM
> *To: *"user@spark.apache.org" <user@spark.apache.org>
> *Subject: *[SparkSQL] SparkSQL performance on small TPCDS tables is very
> low when compared to Drill or Presto
>
>
>
> Hi,
>
>
>
> I am execut
UI.
From: Tin Vu <tvu...@ucr.edu>
Date: Wednesday, March 28, 2018 at 8:04 PM
To: "user@spark.apache.org" <user@spark.apache.org>
Subject: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when
compared to Drill or Presto
Hi,
I am executing a benchmark
mat.
>
> I executed a very simple SQL query: "SELECT * from table_name"
> The issue is that for some small size tables (even table with few dozen of
> records), SparkSQL still required about 7-8 seconds to finish, while Drill
> and Presto only needed less than 1 second.
> For o
s to finish, while Drill
> and Presto only needed less than 1 second.
> For other large tables with billions records, SparkSQL performance was
> reasonable when it required 20-30 seconds to scan the whole table.
> Do you have any idea or reasonable explanation for this issue?
> Thanks,
>
n 1 second.
For other large tables with billions records, SparkSQL performance was
reasonable when it required 20-30 seconds to scan the whole table.
Do you have any idea or reasonable explanation for this issue?
Thanks,
Hammad,
The recommended way to implement this logic would be to:
Create a SparkSession.
Create a Streaming Context using the SparkContext embedded in the
SparkSession
Use the single SparkSession instance for the SQL operations within the
foreachRDD.
It's important to note that spark operations
Hello,
*Background:*
I have Spark Streaming context;
SparkConf conf = new
SparkConf().setMaster("local[2]").setAppName("TransformerStreamPOC");
conf.set("spark.driver.allowMultipleContexts", "true"); *<== this*
JavaStreamingContext jssc = new JavaStreamingContext(conf,
Durations.seconds(60));
https://github.com/databricks/spark-avro
On Tue, Apr 21, 2015 at 3:09 PM, Renato Marroquín Mogrovejo
renatoj.marroq...@gmail.com wrote:
Thanks Michael!
I have tried applying my schema programatically but I didn't get any
improvement on performance :(
Could you point me to some code examples
Here is an example using rows directly:
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema
Avro or parquet input would likely give you the best performance.
On Tue, Apr 21, 2015 at 4:28 AM, Renato Marroquín Mogrovejo
renatoj.marroq...@gmail.com
Thanks Michael!
I have tried applying my schema programatically but I didn't get any
improvement on performance :(
Could you point me to some code examples using Avro please?
Many thanks again!
Renato M.
2015-04-21 20:45 GMT+02:00 Michael Armbrust mich...@databricks.com:
Here is an example
Thanks for the hints guys! much appreciated!
Even if I just do a something like:
Select * from tableX where attribute1 5
I see similar behaviour.
@Michael
Could you point me to any sample code that uses Spark's Rows? We are at a
phase where we can actually change our JavaBeans for something
Does anybody have an idea? a clue? a hint?
Thanks!
Renato M.
2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo
renatoj.marroq...@gmail.com:
Hi all,
I have a simple query Select * from tableX where attribute1 between 0 and
5 that I run over a Kryo file with four partitions that ends up
SparkSQL optimizes better by column pruning and predicate pushdown,
primarily. Here you are not taking advantage of either.
I am curious to know what goes in your filter function, as you are not
using a filter in SQL side.
Best
Ayan
On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo
There is a cost to converting from JavaBeans to Rows and this code path has
not been optimized. That is likely what you are seeing.
On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote:
SparkSQL optimizes better by column pruning and predicate pushdown,
primarily. Here you are
Hi all,
I have a simple query Select * from tableX where attribute1 between 0 and
5 that I run over a Kryo file with four partitions that ends up being
around 3.5 million rows in our case.
If I run this query by doing a simple map().filter() it takes around ~9.6
seconds but when I apply schema,
Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db.
Use case is Spark Yarn app will start and serve as query server for
multiple users i.e. always up and running. At startup, there is option to
cache data and also pre-compute some results sets, hash maps etc. that
would be
On 1/27/15 5:55 PM, Cheng Lian wrote:
On 1/27/15 11:38 AM, Manoj Samel wrote:
Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db.
Use case is Spark Yarn app will start and serve as query server for
multiple users i.e. always up and running. At startup, there is
option
Subject: SparkSQL performance
I was really surprised to see the results here, esp. SparkSQL not completing
http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
I was under the impression that SparkSQL performs really well because it can
optimize the RDD operations and load only
I was really surprised to see the results here, esp. SparkSQL not
completing
http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
I was under the impression that SparkSQL performs really well because it
can optimize the RDD operations and load only the columns that are
required.
From: Soumya Simanta soumya.sima...@gmail.commailto:soumya.sima...@gmail.com
Date: Friday, October 31, 2014 at 4:04 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
user@spark.apache.orgmailto:user@spark.apache.org
Subject: SparkSQL performance
I was really surprised to see the results
Subject: SparkSQL performance
I was really surprised to see the results here, esp. SparkSQL not
completing
http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
I was under the impression that SparkSQL performs really well because it
can optimize the RDD operations
24 matches
Mail list logo