Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-31 Thread Tin Vu
Hi Gaurav,

Thank you for your response. This is the answer for your questions:
1. Spark 2.3.0
2. I was using 'spark-sql' command, for example: 'spark-sql --master
spark:/*:7077 --database tpcds_bin_partitioned_orc_100 -f $file_name' wih
file_name is the file that contains SQL script ("select * from table_name").
3. Hadoop 2.9.0

I am using JDBS connector to Drill from Hive Metastore. SparkSQL is also
connecting to ORC database by Hive.

Thanks so much!

Tin

On Sat, Mar 31, 2018 at 11:41 AM, Gourav Sengupta <gourav.sengu...@gmail.com
> wrote:

> Hi Tin,
>
> This sounds interesting. While I would prefer to think that Presto and
> Drill have
>
> can you please provide the following details:
> 1. SPARK version
> 2. The exact code used in SPARK (the full code that was used)
> 3. HADOOP version
>
> I do think that SPARK and DRILL have complementary and different used
> cases. Have you tried using JDBC connector to Drill from within SPARKSQL?
>
> Regards,
> Gourav Sengupta
>
>
> On Thu, Mar 29, 2018 at 1:03 AM, Tin Vu <tvu...@ucr.edu> wrote:
>
>> Hi,
>>
>> I am executing a benchmark to compare performance of SparkSQL, Apache
>> Drill and Presto. My experimental setup:
>>
>>- TPCDS dataset with scale factor 100 (size 100GB).
>>- Spark, Drill, Presto have a same number of workers: 12.
>>- Each worked has same allocated amount of memory: 4GB.
>>- Data is stored by Hive with ORC format.
>>
>> I executed a very simple SQL query: "SELECT * from table_name"
>> The issue is that for some small size tables (even table with few dozen
>> of records), SparkSQL still required about 7-8 seconds to finish, while
>> Drill and Presto only needed less than 1 second.
>> For other large tables with billions records, SparkSQL performance was
>> reasonable when it required 20-30 seconds to scan the whole table.
>> Do you have any idea or reasonable explanation for this issue?
>>
>> Thanks,
>>
>>
>


Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-31 Thread Gourav Sengupta
Hi Tin,

This sounds interesting. While I would prefer to think that Presto and
Drill have

can you please provide the following details:
1. SPARK version
2. The exact code used in SPARK (the full code that was used)
3. HADOOP version

I do think that SPARK and DRILL have complementary and different used
cases. Have you tried using JDBC connector to Drill from within SPARKSQL?

Regards,
Gourav Sengupta


On Thu, Mar 29, 2018 at 1:03 AM, Tin Vu <tvu...@ucr.edu> wrote:

> Hi,
>
> I am executing a benchmark to compare performance of SparkSQL, Apache
> Drill and Presto. My experimental setup:
>
>- TPCDS dataset with scale factor 100 (size 100GB).
>- Spark, Drill, Presto have a same number of workers: 12.
>- Each worked has same allocated amount of memory: 4GB.
>- Data is stored by Hive with ORC format.
>
> I executed a very simple SQL query: "SELECT * from table_name"
> The issue is that for some small size tables (even table with few dozen of
> records), SparkSQL still required about 7-8 seconds to finish, while Drill
> and Presto only needed less than 1 second.
> For other large tables with billions records, SparkSQL performance was
> reasonable when it required 20-30 seconds to scan the whole table.
> Do you have any idea or reasonable explanation for this issue?
>
> Thanks,
>
>


Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-29 Thread Lalwani, Jayesh
It depends on how you have loaded data.. Ideally, if you have dozens of 
records, your input data should have them in one partition. If the input has 1 
partition, and data is small enough, Spark will keep it in one partition (as 
far as possible)

If you cannot control your data, you need to repartition the data when you load 
it  This will (eventually) cause a shuffle and all the data will be moved into 
the number of partitions that you specify. Subsequent operations will be on the 
repartitioned dataframe, and should take number of tasks. Shuffle has costs 
assosciated with it. You will need to make a call whether you want to take the 
upfront cost of a shuffle, or you want to live with large number of tasks

From: Tin Vu <tvu...@ucr.edu>
Date: Thursday, March 29, 2018 at 10:47 AM
To: "Lalwani, Jayesh" <jayesh.lalw...@capitalone.com>
Cc: "user@spark.apache.org" <user@spark.apache.org>
Subject: Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low 
when compared to Drill or Presto

 You are right. There are too much tasks was created. How can we reduce the 
number of tasks?

On Thu, Mar 29, 2018, 7:44 AM Lalwani, Jayesh 
<jayesh.lalw...@capitalone.com<mailto:jayesh.lalw...@capitalone.com>> wrote:
Without knowing too many details, I can only guess. It could be that Spark is 
creating a lot of tasks even though there are less records. Creation and 
distribution of tasks has a noticeable overhead on smaller datasets.

You might want to look at the driver logs, or the Spark Application Detail UI.

From: Tin Vu <tvu...@ucr.edu<mailto:tvu...@ucr.edu>>
Date: Wednesday, March 28, 2018 at 8:04 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when 
compared to Drill or Presto

Hi,

I am executing a benchmark to compare performance of SparkSQL, Apache Drill and 
Presto. My experimental setup:
• TPCDS dataset with scale factor 100 (size 100GB).
• Spark, Drill, Presto have a same number of workers: 12.
• Each worked has same allocated amount of memory: 4GB.
• Data is stored by Hive with ORC format.

I executed a very simple SQL query: "SELECT * from table_name"
The issue is that for some small size tables (even table with few dozen of 
records), SparkSQL still required about 7-8 seconds to finish, while Drill and 
Presto only needed less than 1 second.
For other large tables with billions records, SparkSQL performance was 
reasonable when it required 20-30 seconds to scan the whole table.
Do you have any idea or reasonable explanation for this issue?

Thanks,




The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-29 Thread Tin Vu
 You are right. There are too much tasks was created. How can we reduce the
number of tasks?

On Thu, Mar 29, 2018, 7:44 AM Lalwani, Jayesh <jayesh.lalw...@capitalone.com>
wrote:

> Without knowing too many details, I can only guess. It could be that Spark
> is creating a lot of tasks even though there are less records. Creation and
> distribution of tasks has a noticeable overhead on smaller datasets.
>
>
>
> You might want to look at the driver logs, or the Spark Application Detail
> UI.
>
>
>
> *From: *Tin Vu <tvu...@ucr.edu>
> *Date: *Wednesday, March 28, 2018 at 8:04 PM
> *To: *"user@spark.apache.org" <user@spark.apache.org>
> *Subject: *[SparkSQL] SparkSQL performance on small TPCDS tables is very
> low when compared to Drill or Presto
>
>
>
> Hi,
>
>
>
> I am executing a benchmark to compare performance of SparkSQL, Apache
> Drill and Presto. My experimental setup:
>
> · TPCDS dataset with scale factor 100 (size 100GB).
>
> · Spark, Drill, Presto have a same number of workers: 12.
>
> · Each worked has same allocated amount of memory: 4GB.
>
> · Data is stored by Hive with ORC format.
>
> I executed a very simple SQL query: "SELECT * from table_name"
> The issue is that for some small size tables (even table with few dozen of
> records), SparkSQL still required about 7-8 seconds to finish, while Drill
> and Presto only needed less than 1 second.
> For other large tables with billions records, SparkSQL performance was
> reasonable when it required 20-30 seconds to scan the whole table.
> Do you have any idea or reasonable explanation for this issue?
>
> Thanks,
>
>
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>


Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-29 Thread Lalwani, Jayesh
Without knowing too many details, I can only guess. It could be that Spark is 
creating a lot of tasks even though there are less records. Creation and 
distribution of tasks has a noticeable overhead on smaller datasets.

You might want to look at the driver logs, or the Spark Application Detail UI.

From: Tin Vu <tvu...@ucr.edu>
Date: Wednesday, March 28, 2018 at 8:04 PM
To: "user@spark.apache.org" <user@spark.apache.org>
Subject: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when 
compared to Drill or Presto

Hi,

I am executing a benchmark to compare performance of SparkSQL, Apache Drill and 
Presto. My experimental setup:
· TPCDS dataset with scale factor 100 (size 100GB).
· Spark, Drill, Presto have a same number of workers: 12.
· Each worked has same allocated amount of memory: 4GB.
· Data is stored by Hive with ORC format.

I executed a very simple SQL query: "SELECT * from table_name"
The issue is that for some small size tables (even table with few dozen of 
records), SparkSQL still required about 7-8 seconds to finish, while Drill and 
Presto only needed less than 1 second.
For other large tables with billions records, SparkSQL performance was 
reasonable when it required 20-30 seconds to scan the whole table.
Do you have any idea or reasonable explanation for this issue?

Thanks,



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Tin Vu
Thanks for your response.  What do you mean when you said "immediately
return"?

On Wed, Mar 28, 2018, 10:33 PM Jörn Franke <jornfra...@gmail.com> wrote:

> I don’t think select * is a good benchmark. You should do a more complex
> operation, otherwise optimizes might see that you don’t do anything in the
> query and immediately return (similarly count might immediately return by
> using some statistics).
>
> On 29. Mar 2018, at 02:03, Tin Vu <tvu...@ucr.edu> wrote:
>
> Hi,
>
> I am executing a benchmark to compare performance of SparkSQL, Apache
> Drill and Presto. My experimental setup:
>
>- TPCDS dataset with scale factor 100 (size 100GB).
>- Spark, Drill, Presto have a same number of workers: 12.
>- Each worked has same allocated amount of memory: 4GB.
>- Data is stored by Hive with ORC format.
>
> I executed a very simple SQL query: "SELECT * from table_name"
> The issue is that for some small size tables (even table with few dozen of
> records), SparkSQL still required about 7-8 seconds to finish, while Drill
> and Presto only needed less than 1 second.
> For other large tables with billions records, SparkSQL performance was
> reasonable when it required 20-30 seconds to scan the whole table.
> Do you have any idea or reasonable explanation for this issue?
>
> Thanks,
>
>


Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Jörn Franke
I don’t think select * is a good benchmark. You should do a more complex 
operation, otherwise optimizes might see that you don’t do anything in the 
query and immediately return (similarly count might immediately return by using 
some statistics).

> On 29. Mar 2018, at 02:03, Tin Vu <tvu...@ucr.edu> wrote:
> 
> Hi,
> 
> I am executing a benchmark to compare performance of SparkSQL, Apache Drill 
> and Presto. My experimental setup:
> TPCDS dataset with scale factor 100 (size 100GB).
> Spark, Drill, Presto have a same number of workers: 12.
> Each worked has same allocated amount of memory: 4GB.
> Data is stored by Hive with ORC format.
> I executed a very simple SQL query: "SELECT * from table_name"
> The issue is that for some small size tables (even table with few dozen of 
> records), SparkSQL still required about 7-8 seconds to finish, while Drill 
> and Presto only needed less than 1 second.
> For other large tables with billions records, SparkSQL performance was 
> reasonable when it required 20-30 seconds to scan the whole table.
> Do you have any idea or reasonable explanation for this issue?
> Thanks,
> 


[SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Tin Vu
Hi,

I am executing a benchmark to compare performance of SparkSQL, Apache Drill
and Presto. My experimental setup:

   - TPCDS dataset with scale factor 100 (size 100GB).
   - Spark, Drill, Presto have a same number of workers: 12.
   - Each worked has same allocated amount of memory: 4GB.
   - Data is stored by Hive with ORC format.

I executed a very simple SQL query: "SELECT * from table_name"
The issue is that for some small size tables (even table with few dozen of
records), SparkSQL still required about 7-8 seconds to finish, while Drill
and Presto only needed less than 1 second.
For other large tables with billions records, SparkSQL performance was
reasonable when it required 20-30 seconds to scan the whole table.
Do you have any idea or reasonable explanation for this issue?

Thanks,


Re: Spark Streaming - Multiple Spark Contexts (SparkSQL) Performance

2017-10-01 Thread Gerard Maas
Hammad,

The recommended way to implement this logic would be to:

Create a SparkSession.
Create a Streaming Context using the SparkContext embedded in the
SparkSession

Use the single SparkSession instance for the SQL operations within the
foreachRDD.
It's important to note that spark operations can process the complete
dataset. In this case, there's no need to do a perPartition or perElement
operation. (that would be the case if we were directly using the drivers
API and DB connections)

Reorganizing the code in the question a bit, we should have:

 SparkSession sparkSession = SparkSession
.builder()
.setMaster("local[2]").setAppName("TransformerStreamPOC")

.config("spark.some.config.option", "some-value")
.getOrCreate();

JavaStreamingContext jssc = new
JavaStreamingContext(sparkSession.sparkContext,
Durations.seconds(60));

// this dataset doesn't seem to depend on the received data, so we can
load it once.

Dataset baselineData =
sparkSession.read().jdbc(MYSQL_CONNECTION_URL, "table_name",
connectionProperties);

// create dstream

DStream dstream = ...

... operations on dstream...

dstream.foreachRDD { rdd =>

Dataset incomingData = sparkSession.createDataset(rdd)

   ... do something the incoming dataset, eg. join with the baseline ...

   DataFrame joined =  incomingData.join(baselineData, ...)

   ... do something with joined ...

  }


kr, Gerard.

On Sun, Oct 1, 2017 at 7:55 PM, Hammad  wrote:

> Hello,
>
> *Background:*
>
> I have Spark Streaming context;
>
> SparkConf conf = new 
> SparkConf().setMaster("local[2]").setAppName("TransformerStreamPOC");
> conf.set("spark.driver.allowMultipleContexts", "true");   *<== this*
> JavaStreamingContext jssc = new JavaStreamingContext(conf, 
> Durations.seconds(60));
>
>
> that subscribes to certain kafka *topics*;
>
> JavaInputDStream> stream =
> KafkaUtils.createDirectStream(
> jssc,
> LocationStrategies.PreferConsistent(),
> ConsumerStrategies.Subscribe(*topics*, 
> kafkaParams)
> );
>
> when messages arrive in queue, I recursively process them as follows (below 
> code section will repeat in Question statement)
>
> stream.foreachRDD(rdd -> {
> //process here - below two scenarions code is inserted here
>
> });
>
>
> *Question starts here:*
>
> Since I need to apply SparkSQL to received events in Queue - I create 
> SparkSession with two scenarios;
>
> *1) Per partition one sparkSession (after 
> "spark.driver.allowMultipleContexts" set to true); so all events under this 
> partition are handled by same sparkSession*
>
> rdd.foreachPartition(partition -> {
> SparkSession sparkSession = SparkSession
> .builder()
> .appName("Java Spark SQL basic example")
> .config("spark.some.config.option", "some-value")
> .getOrCreate();
>
> while (partition.hasNext()) {
>   Dataset df = sparkSession.read().jdbc(MYSQL_CONNECTION_URL, 
> "table_name", connectionProperties);
>
> }}
>
> *2) Per event under each session; so each event under each queue under each 
> stream has one sparkSession;*
>
> rdd.foreachPartition(partition -> {while (partition.hasNext()) {
> SparkSession sparkSession = SparkSession.builder().appName("Java Spark SQL 
> basic example").config("spark.some.config.option", 
> "some-value").getOrCreate();
>
> Dataset df = sparkSession.read().jdbc(MYSQL_CONNECTION_URL, 
> "table_name", connectionProperties);
>
> }}
>
>
> Is it good practice to create multiple contexts (lets say 10 or 100)?
> How does number of sparkContext to be allowed vs number of worker nodes
> relate?
> What are performance considerations with respect to scenario1 and
> scenario2?
>
> I am looking for these answers as I feel there is more to what I
> understand of performance w.r.t sparkContexts created by a streaming
> application.
> Really appreciate your support in anticipation.
>
> Hammad
>
>


Fwd: Spark Streaming - Multiple Spark Contexts (SparkSQL) Performance

2017-10-01 Thread Hammad
Hello,

*Background:*

I have Spark Streaming context;

SparkConf conf = new
SparkConf().setMaster("local[2]").setAppName("TransformerStreamPOC");
conf.set("spark.driver.allowMultipleContexts", "true");   *<== this*
JavaStreamingContext jssc = new JavaStreamingContext(conf,
Durations.seconds(60));


that subscribes to certain kafka *topics*;

JavaInputDStream> stream =
KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.Subscribe(*topics*,
kafkaParams)
);

when messages arrive in queue, I recursively process them as follows
(below code section will repeat in Question statement)

stream.foreachRDD(rdd -> {
//process here - below two scenarions code is inserted here

});


*Question starts here:*

Since I need to apply SparkSQL to received events in Queue - I create
SparkSession with two scenarios;

*1) Per partition one sparkSession (after
"spark.driver.allowMultipleContexts" set to true); so all events under
this partition are handled by same sparkSession*

rdd.foreachPartition(partition -> {
SparkSession sparkSession = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate();

while (partition.hasNext()) {
  Dataset df = sparkSession.read().jdbc(MYSQL_CONNECTION_URL,
"table_name", connectionProperties);

}}

*2) Per event under each session; so each event under each queue under
each stream has one sparkSession;*

rdd.foreachPartition(partition -> {while (partition.hasNext()) {
 SparkSession sparkSession = SparkSession.builder().appName("Java
Spark SQL basic example").config("spark.some.config.option",
"some-value").getOrCreate();

Dataset df = sparkSession.read().jdbc(MYSQL_CONNECTION_URL,
"table_name", connectionProperties);

}}


Is it good practice to create multiple contexts (lets say 10 or 100)?
How does number of sparkContext to be allowed vs number of worker nodes
relate?
What are performance considerations with respect to scenario1 and scenario2?

I am looking for these answers as I feel there is more to what I understand
of performance w.r.t sparkContexts created by a streaming application.
Really appreciate your support in anticipation.

Hammad


Re: SparkSQL performance

2015-04-22 Thread Michael Armbrust
https://github.com/databricks/spark-avro

On Tue, Apr 21, 2015 at 3:09 PM, Renato Marroquín Mogrovejo 
renatoj.marroq...@gmail.com wrote:

 Thanks Michael!
 I have tried applying my schema programatically but I didn't get any
 improvement on performance :(
 Could you point me to some code examples using Avro please?
 Many thanks again!


 Renato M.

 2015-04-21 20:45 GMT+02:00 Michael Armbrust mich...@databricks.com:

 Here is an example using rows directly:

 https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema

 Avro or parquet input would likely give you the best performance.

 On Tue, Apr 21, 2015 at 4:28 AM, Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com wrote:

 Thanks for the hints guys! much appreciated!
 Even if I just do a something like:

 Select * from tableX where attribute1  5

 I see similar behaviour.

 @Michael
 Could you point me to any sample code that uses Spark's Rows? We are at
 a phase where we can actually change our JavaBeans for something that
 provides a better performance than what we are seeing now. Would you
 recommend using Avro presentation then?
 Thanks again!


 Renato M.

 2015-04-21 1:18 GMT+02:00 Michael Armbrust mich...@databricks.com:

 There is a cost to converting from JavaBeans to Rows and this code path
 has not been optimized.  That is likely what you are seeing.

 On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote:

 SparkSQL optimizes better by column pruning and predicate pushdown,
 primarily. Here you are not taking advantage of either.

 I am curious to know what goes in your filter function, as you are not
 using a filter in SQL side.

 Best
 Ayan
 On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com wrote:

 Does anybody have an idea? a clue? a hint?
 Thanks!


 Renato M.

 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com:

 Hi all,

 I have a simple query Select * from tableX where attribute1 between
 0 and 5 that I run over a Kryo file with four partitions that ends up
 being around 3.5 million rows in our case.
 If I run this query by doing a simple map().filter() it takes around
 ~9.6 seconds but when I apply schema, register the table into a 
 SqlContext,
 and then run the query, it takes around ~16 seconds. This is using Spark
 1.2.1 with Scala 2.10.0
 I am wondering why there is such a big gap on performance if it is
 just a filter. Internally, the relation files are mapped to a JavaBean.
 This different data presentation (JavaBeans vs SparkSQL internal
 representation) could lead to such difference? Is there anything I 
 could do
 to make the performance get closer to the hard-coded option?
 Thanks in advance for any suggestions or ideas.


 Renato M.









Re: SparkSQL performance

2015-04-21 Thread Michael Armbrust
Here is an example using rows directly:
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema

Avro or parquet input would likely give you the best performance.

On Tue, Apr 21, 2015 at 4:28 AM, Renato Marroquín Mogrovejo 
renatoj.marroq...@gmail.com wrote:

 Thanks for the hints guys! much appreciated!
 Even if I just do a something like:

 Select * from tableX where attribute1  5

 I see similar behaviour.

 @Michael
 Could you point me to any sample code that uses Spark's Rows? We are at a
 phase where we can actually change our JavaBeans for something that
 provides a better performance than what we are seeing now. Would you
 recommend using Avro presentation then?
 Thanks again!


 Renato M.

 2015-04-21 1:18 GMT+02:00 Michael Armbrust mich...@databricks.com:

 There is a cost to converting from JavaBeans to Rows and this code path
 has not been optimized.  That is likely what you are seeing.

 On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote:

 SparkSQL optimizes better by column pruning and predicate pushdown,
 primarily. Here you are not taking advantage of either.

 I am curious to know what goes in your filter function, as you are not
 using a filter in SQL side.

 Best
 Ayan
 On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com wrote:

 Does anybody have an idea? a clue? a hint?
 Thanks!


 Renato M.

 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com:

 Hi all,

 I have a simple query Select * from tableX where attribute1 between 0
 and 5 that I run over a Kryo file with four partitions that ends up being
 around 3.5 million rows in our case.
 If I run this query by doing a simple map().filter() it takes around
 ~9.6 seconds but when I apply schema, register the table into a 
 SqlContext,
 and then run the query, it takes around ~16 seconds. This is using Spark
 1.2.1 with Scala 2.10.0
 I am wondering why there is such a big gap on performance if it is
 just a filter. Internally, the relation files are mapped to a JavaBean.
 This different data presentation (JavaBeans vs SparkSQL internal
 representation) could lead to such difference? Is there anything I could 
 do
 to make the performance get closer to the hard-coded option?
 Thanks in advance for any suggestions or ideas.


 Renato M.







Re: SparkSQL performance

2015-04-21 Thread Renato Marroquín Mogrovejo
Thanks Michael!
I have tried applying my schema programatically but I didn't get any
improvement on performance :(
Could you point me to some code examples using Avro please?
Many thanks again!


Renato M.

2015-04-21 20:45 GMT+02:00 Michael Armbrust mich...@databricks.com:

 Here is an example using rows directly:

 https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema

 Avro or parquet input would likely give you the best performance.

 On Tue, Apr 21, 2015 at 4:28 AM, Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com wrote:

 Thanks for the hints guys! much appreciated!
 Even if I just do a something like:

 Select * from tableX where attribute1  5

 I see similar behaviour.

 @Michael
 Could you point me to any sample code that uses Spark's Rows? We are at a
 phase where we can actually change our JavaBeans for something that
 provides a better performance than what we are seeing now. Would you
 recommend using Avro presentation then?
 Thanks again!


 Renato M.

 2015-04-21 1:18 GMT+02:00 Michael Armbrust mich...@databricks.com:

 There is a cost to converting from JavaBeans to Rows and this code path
 has not been optimized.  That is likely what you are seeing.

 On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote:

 SparkSQL optimizes better by column pruning and predicate pushdown,
 primarily. Here you are not taking advantage of either.

 I am curious to know what goes in your filter function, as you are not
 using a filter in SQL side.

 Best
 Ayan
 On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com wrote:

 Does anybody have an idea? a clue? a hint?
 Thanks!


 Renato M.

 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com:

 Hi all,

 I have a simple query Select * from tableX where attribute1 between
 0 and 5 that I run over a Kryo file with four partitions that ends up
 being around 3.5 million rows in our case.
 If I run this query by doing a simple map().filter() it takes around
 ~9.6 seconds but when I apply schema, register the table into a 
 SqlContext,
 and then run the query, it takes around ~16 seconds. This is using Spark
 1.2.1 with Scala 2.10.0
 I am wondering why there is such a big gap on performance if it is
 just a filter. Internally, the relation files are mapped to a JavaBean.
 This different data presentation (JavaBeans vs SparkSQL internal
 representation) could lead to such difference? Is there anything I could 
 do
 to make the performance get closer to the hard-coded option?
 Thanks in advance for any suggestions or ideas.


 Renato M.








Re: SparkSQL performance

2015-04-21 Thread Renato Marroquín Mogrovejo
Thanks for the hints guys! much appreciated!
Even if I just do a something like:

Select * from tableX where attribute1  5

I see similar behaviour.

@Michael
Could you point me to any sample code that uses Spark's Rows? We are at a
phase where we can actually change our JavaBeans for something that
provides a better performance than what we are seeing now. Would you
recommend using Avro presentation then?
Thanks again!


Renato M.

2015-04-21 1:18 GMT+02:00 Michael Armbrust mich...@databricks.com:

 There is a cost to converting from JavaBeans to Rows and this code path
 has not been optimized.  That is likely what you are seeing.

 On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote:

 SparkSQL optimizes better by column pruning and predicate pushdown,
 primarily. Here you are not taking advantage of either.

 I am curious to know what goes in your filter function, as you are not
 using a filter in SQL side.

 Best
 Ayan
 On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com wrote:

 Does anybody have an idea? a clue? a hint?
 Thanks!


 Renato M.

 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com:

 Hi all,

 I have a simple query Select * from tableX where attribute1 between 0
 and 5 that I run over a Kryo file with four partitions that ends up being
 around 3.5 million rows in our case.
 If I run this query by doing a simple map().filter() it takes around
 ~9.6 seconds but when I apply schema, register the table into a SqlContext,
 and then run the query, it takes around ~16 seconds. This is using Spark
 1.2.1 with Scala 2.10.0
 I am wondering why there is such a big gap on performance if it is just
 a filter. Internally, the relation files are mapped to a JavaBean. This
 different data presentation (JavaBeans vs SparkSQL internal representation)
 could lead to such difference? Is there anything I could do to make the
 performance get closer to the hard-coded option?
 Thanks in advance for any suggestions or ideas.


 Renato M.






Re: SparkSQL performance

2015-04-20 Thread Renato Marroquín Mogrovejo
Does anybody have an idea? a clue? a hint?
Thanks!


Renato M.

2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo 
renatoj.marroq...@gmail.com:

 Hi all,

 I have a simple query Select * from tableX where attribute1 between 0 and
 5 that I run over a Kryo file with four partitions that ends up being
 around 3.5 million rows in our case.
 If I run this query by doing a simple map().filter() it takes around ~9.6
 seconds but when I apply schema, register the table into a SqlContext, and
 then run the query, it takes around ~16 seconds. This is using Spark 1.2.1
 with Scala 2.10.0
 I am wondering why there is such a big gap on performance if it is just a
 filter. Internally, the relation files are mapped to a JavaBean. This
 different data presentation (JavaBeans vs SparkSQL internal representation)
 could lead to such difference? Is there anything I could do to make the
 performance get closer to the hard-coded option?
 Thanks in advance for any suggestions or ideas.


 Renato M.



Re: SparkSQL performance

2015-04-20 Thread ayan guha
SparkSQL optimizes better by column pruning and predicate pushdown,
primarily. Here you are not taking advantage of either.

I am curious to know what goes in your filter function, as you are not
using a filter in SQL side.

Best
Ayan
On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo 
renatoj.marroq...@gmail.com wrote:

 Does anybody have an idea? a clue? a hint?
 Thanks!


 Renato M.

 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com:

 Hi all,

 I have a simple query Select * from tableX where attribute1 between 0
 and 5 that I run over a Kryo file with four partitions that ends up being
 around 3.5 million rows in our case.
 If I run this query by doing a simple map().filter() it takes around ~9.6
 seconds but when I apply schema, register the table into a SqlContext, and
 then run the query, it takes around ~16 seconds. This is using Spark 1.2.1
 with Scala 2.10.0
 I am wondering why there is such a big gap on performance if it is just a
 filter. Internally, the relation files are mapped to a JavaBean. This
 different data presentation (JavaBeans vs SparkSQL internal representation)
 could lead to such difference? Is there anything I could do to make the
 performance get closer to the hard-coded option?
 Thanks in advance for any suggestions or ideas.


 Renato M.





Re: SparkSQL performance

2015-04-20 Thread Michael Armbrust
There is a cost to converting from JavaBeans to Rows and this code path has
not been optimized.  That is likely what you are seeing.

On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote:

 SparkSQL optimizes better by column pruning and predicate pushdown,
 primarily. Here you are not taking advantage of either.

 I am curious to know what goes in your filter function, as you are not
 using a filter in SQL side.

 Best
 Ayan
 On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com wrote:

 Does anybody have an idea? a clue? a hint?
 Thanks!


 Renato M.

 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com:

 Hi all,

 I have a simple query Select * from tableX where attribute1 between 0
 and 5 that I run over a Kryo file with four partitions that ends up being
 around 3.5 million rows in our case.
 If I run this query by doing a simple map().filter() it takes around
 ~9.6 seconds but when I apply schema, register the table into a SqlContext,
 and then run the query, it takes around ~16 seconds. This is using Spark
 1.2.1 with Scala 2.10.0
 I am wondering why there is such a big gap on performance if it is just
 a filter. Internally, the relation files are mapped to a JavaBean. This
 different data presentation (JavaBeans vs SparkSQL internal representation)
 could lead to such difference? Is there anything I could do to make the
 performance get closer to the hard-coded option?
 Thanks in advance for any suggestions or ideas.


 Renato M.





SparkSQL performance

2015-04-20 Thread Renato Marroquín Mogrovejo
Hi all,

I have a simple query Select * from tableX where attribute1 between 0 and
5 that I run over a Kryo file with four partitions that ends up being
around 3.5 million rows in our case.
If I run this query by doing a simple map().filter() it takes around ~9.6
seconds but when I apply schema, register the table into a SqlContext, and
then run the query, it takes around ~16 seconds. This is using Spark 1.2.1
with Scala 2.10.0
I am wondering why there is such a big gap on performance if it is just a
filter. Internally, the relation files are mapped to a JavaBean. This
different data presentation (JavaBeans vs SparkSQL internal representation)
could lead to such difference? Is there anything I could do to make the
performance get closer to the hard-coded option?
Thanks in advance for any suggestions or ideas.


Renato M.


SparkSQL Performance Tuning Options

2015-01-27 Thread Manoj Samel
Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db.

Use case is Spark Yarn app will start and serve as query server for
multiple users i.e. always up and running. At startup, there is option to
cache data and also pre-compute some results sets, hash maps etc. that
would be likely be asked by client APIs. I.e there is some option to use
startup time to precompute/cache - but query response time requirement on
large data set is very stringent

Hoping to use SparkSQL (but a combination of SQL and RDD APIs is also OK).

* Does SparkSQL execution uses underlying partition information ? (Data is
from HDFS)
* Are there any ways to give hints to the SparkSQL execution about any
precomputed/pre-cached RDDs?
* Packages spark.sql.execution, spark.sql.execution.joins and other sql.xxx
packages - would using these for tuning query plan is recommended? Would
like to keep this as-needed if possible
* Features not in current release but scheduled for upcoming release would
also be good to know.

Thanks,

PS: This is not a small topic so if someone prefers to start a offline
thread on details, I can do that and summarize the conclusions back to this
thread.


Re: SparkSQL Performance Tuning Options

2015-01-27 Thread Cheng Lian


On 1/27/15 5:55 PM, Cheng Lian wrote:


On 1/27/15 11:38 AM, Manoj Samel wrote:

Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db.

Use case is Spark Yarn app will start and serve as query server for 
multiple users i.e. always up and running. At startup, there is 
option to cache data and also pre-compute some results sets, hash 
maps etc. that would be likely be asked by client APIs. I.e there is 
some option to use startup time to precompute/cache - but query 
response time requirement on large data set is very stringent


Hoping to use SparkSQL (but a combination of SQL and RDD APIs is also 
OK).


* Does SparkSQL execution uses underlying partition information ? 
(Data is from HDFS)
No. For example, if the underlying data has already been partitioned 
by some key, Spark SQL doesn't know it, and can't leverage that 
information to avoid shuffle when doing aggregation on that key. 
However, partitioning the data ahead of time does help minimizing 
shuffle network IO. There's a JIRA ticket to enable Spark SQL aware of 
underlying data distribution.


Maybe you are asking about locality? If that's the case, just want to 
add that Spark SQL does understand locality information of the 
underlying data. It's obtained from Hadoop InputFormat.


* Are there any ways to give hints to the SparkSQL execution about 
any precomputed/pre-cached RDDs?
Instead of caching raw RDD, it's recommended to transform raw RDD to 
SchemaRDD and then cache it, so that in-memory columnar storage can be 
used. Also Spark SQL recognizes cached SchemaRDDs automatically.
* Packages spark.sql.execution, spark.sql.execution.joins and other 
sql.xxx packages - would using these for tuning query plan is 
recommended? Would like to keep this as-needed if possible
Not sure whether I understood this question. Are you trying to use 
internal APIs to do customized optimizations?
* Features not in current release but scheduled for upcoming release 
would also be good to know.


Thanks,

PS: This is not a small topic so if someone prefers to start a 
offline thread on details, I can do that and summarize the 
conclusions back to this thread.








-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL performance

2014-11-03 Thread Marius Soutier
I did some simple experiments with Impala and Spark, and Impala came out ahead. 
But it’s also less flexible, couldn’t handle irregular schemas, didn't support 
Json, and so on.

On 01.11.2014, at 02:20, Soumya Simanta soumya.sima...@gmail.com wrote:

 I agree. My personal experience with Spark core is that it performs really 
 well once you tune it properly. 
 
 As far I understand SparkSQL under the hood performs many of these 
 optimizations (order of Spark operations) and uses a more efficient storage 
 format. Is this assumption correct? 
 
 Has anyone done any comparison of SparkSQL with Impala ? The fact that many 
 of the queries don't even finish in the benchmark is quite surprising and 
 hard to believe. 
 
 A few months ago there were a few emails about Spark not being able to handle 
 large volumes (TBs) of data. That myth was busted recently when the folks at 
 Databricks published their sorting record results. 
  
 
 Thanks
 -Soumya
 
 
 
 
  
 
 On Fri, Oct 31, 2014 at 7:35 PM, Du Li l...@yahoo-inc.com wrote:
 We have seen all kinds of results published that often contradict each other. 
 My take is that the authors often know more tricks about how to tune their 
 own/familiar products than the others. So the product on focus is tuned for 
 ideal performance while the competitors are not. The authors are not 
 necessarily biased but as a consequence the results are.
 
 Ideally it’s critical for the user community to be informed of all the 
 in-depth tuning tricks of all products. However, realistically, there is a 
 big gap in terms of documentation. Hope the Spark folks will make a 
 difference. :-)
 
 Du
 
 
 From: Soumya Simanta soumya.sima...@gmail.com
 Date: Friday, October 31, 2014 at 4:04 PM
 To: user@spark.apache.org user@spark.apache.org
 Subject: SparkSQL performance
 
 I was really surprised to see the results here, esp. SparkSQL not completing
 http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
 
 I was under the impression that SparkSQL performs really well because it can 
 optimize the RDD operations and load only the columns that are required. This 
 essentially means in most cases SparkSQL should be as fast as Spark is. 
 
 I would be very interested to hear what others in the group have to say about 
 this. 
 
 Thanks
 -Soumya
 
 
 



SparkSQL performance

2014-10-31 Thread Soumya Simanta
I was really surprised to see the results here, esp. SparkSQL not
completing
http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style

I was under the impression that SparkSQL performs really well because it
can optimize the RDD operations and load only the columns that are
required. This essentially means in most cases SparkSQL should be as fast
as Spark is.

I would be very interested to hear what others in the group have to say
about this.

Thanks
-Soumya


Re: SparkSQL performance

2014-10-31 Thread Du Li
We have seen all kinds of results published that often contradict each other. 
My take is that the authors often know more tricks about how to tune their 
own/familiar products than the others. So the product on focus is tuned for 
ideal performance while the competitors are not. The authors are not 
necessarily biased but as a consequence the results are.

Ideally it’s critical for the user community to be informed of all the in-depth 
tuning tricks of all products. However, realistically, there is a big gap in 
terms of documentation. Hope the Spark folks will make a difference. :-)

Du


From: Soumya Simanta soumya.sima...@gmail.commailto:soumya.sima...@gmail.com
Date: Friday, October 31, 2014 at 4:04 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: SparkSQL performance

I was really surprised to see the results here, esp. SparkSQL not completing
http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style

I was under the impression that SparkSQL performs really well because it can 
optimize the RDD operations and load only the columns that are required. This 
essentially means in most cases SparkSQL should be as fast as Spark is.

I would be very interested to hear what others in the group have to say about 
this.

Thanks
-Soumya




Re: SparkSQL performance

2014-10-31 Thread Soumya Simanta
I agree. My personal experience with Spark core is that it performs really
well once you tune it properly.

As far I understand SparkSQL under the hood performs many of these
optimizations (order of Spark operations) and uses a more efficient storage
format. Is this assumption correct?

Has anyone done any comparison of SparkSQL with Impala ? The fact that many
of the queries don't even finish in the benchmark is quite surprising and
hard to believe.

A few months ago there were a few emails about Spark not being able to
handle large volumes (TBs) of data. That myth was busted recently when the
folks at Databricks published their sorting record results.


Thanks
-Soumya






On Fri, Oct 31, 2014 at 7:35 PM, Du Li l...@yahoo-inc.com wrote:

   We have seen all kinds of results published that often contradict each
 other. My take is that the authors often know more tricks about how to tune
 their own/familiar products than the others. So the product on focus is
 tuned for ideal performance while the competitors are not. The authors are
 not necessarily biased but as a consequence the results are.

  Ideally it’s critical for the user community to be informed of all the
 in-depth tuning tricks of all products. However, realistically, there is a
 big gap in terms of documentation. Hope the Spark folks will make a
 difference. :-)

  Du


   From: Soumya Simanta soumya.sima...@gmail.com
 Date: Friday, October 31, 2014 at 4:04 PM
 To: user@spark.apache.org user@spark.apache.org
 Subject: SparkSQL performance

   I was really surprised to see the results here, esp. SparkSQL not
 completing
 http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style

  I was under the impression that SparkSQL performs really well because it
 can optimize the RDD operations and load only the columns that are
 required. This essentially means in most cases SparkSQL should be as fast
 as Spark is.

  I would be very interested to hear what others in the group have to say
 about this.

  Thanks
 -Soumya