Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto
Hi Gaurav, Thank you for your response. This is the answer for your questions: 1. Spark 2.3.0 2. I was using 'spark-sql' command, for example: 'spark-sql --master spark:/*:7077 --database tpcds_bin_partitioned_orc_100 -f $file_name' wih file_name is the file that contains SQL script ("select * from table_name"). 3. Hadoop 2.9.0 I am using JDBS connector to Drill from Hive Metastore. SparkSQL is also connecting to ORC database by Hive. Thanks so much! Tin On Sat, Mar 31, 2018 at 11:41 AM, Gourav Sengupta <gourav.sengu...@gmail.com > wrote: > Hi Tin, > > This sounds interesting. While I would prefer to think that Presto and > Drill have > > can you please provide the following details: > 1. SPARK version > 2. The exact code used in SPARK (the full code that was used) > 3. HADOOP version > > I do think that SPARK and DRILL have complementary and different used > cases. Have you tried using JDBC connector to Drill from within SPARKSQL? > > Regards, > Gourav Sengupta > > > On Thu, Mar 29, 2018 at 1:03 AM, Tin Vu <tvu...@ucr.edu> wrote: > >> Hi, >> >> I am executing a benchmark to compare performance of SparkSQL, Apache >> Drill and Presto. My experimental setup: >> >>- TPCDS dataset with scale factor 100 (size 100GB). >>- Spark, Drill, Presto have a same number of workers: 12. >>- Each worked has same allocated amount of memory: 4GB. >>- Data is stored by Hive with ORC format. >> >> I executed a very simple SQL query: "SELECT * from table_name" >> The issue is that for some small size tables (even table with few dozen >> of records), SparkSQL still required about 7-8 seconds to finish, while >> Drill and Presto only needed less than 1 second. >> For other large tables with billions records, SparkSQL performance was >> reasonable when it required 20-30 seconds to scan the whole table. >> Do you have any idea or reasonable explanation for this issue? >> >> Thanks, >> >> >
Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto
Hi Tin, This sounds interesting. While I would prefer to think that Presto and Drill have can you please provide the following details: 1. SPARK version 2. The exact code used in SPARK (the full code that was used) 3. HADOOP version I do think that SPARK and DRILL have complementary and different used cases. Have you tried using JDBC connector to Drill from within SPARKSQL? Regards, Gourav Sengupta On Thu, Mar 29, 2018 at 1:03 AM, Tin Vu <tvu...@ucr.edu> wrote: > Hi, > > I am executing a benchmark to compare performance of SparkSQL, Apache > Drill and Presto. My experimental setup: > >- TPCDS dataset with scale factor 100 (size 100GB). >- Spark, Drill, Presto have a same number of workers: 12. >- Each worked has same allocated amount of memory: 4GB. >- Data is stored by Hive with ORC format. > > I executed a very simple SQL query: "SELECT * from table_name" > The issue is that for some small size tables (even table with few dozen of > records), SparkSQL still required about 7-8 seconds to finish, while Drill > and Presto only needed less than 1 second. > For other large tables with billions records, SparkSQL performance was > reasonable when it required 20-30 seconds to scan the whole table. > Do you have any idea or reasonable explanation for this issue? > > Thanks, > >
Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto
It depends on how you have loaded data.. Ideally, if you have dozens of records, your input data should have them in one partition. If the input has 1 partition, and data is small enough, Spark will keep it in one partition (as far as possible) If you cannot control your data, you need to repartition the data when you load it This will (eventually) cause a shuffle and all the data will be moved into the number of partitions that you specify. Subsequent operations will be on the repartitioned dataframe, and should take number of tasks. Shuffle has costs assosciated with it. You will need to make a call whether you want to take the upfront cost of a shuffle, or you want to live with large number of tasks From: Tin Vu <tvu...@ucr.edu> Date: Thursday, March 29, 2018 at 10:47 AM To: "Lalwani, Jayesh" <jayesh.lalw...@capitalone.com> Cc: "user@spark.apache.org" <user@spark.apache.org> Subject: Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto You are right. There are too much tasks was created. How can we reduce the number of tasks? On Thu, Mar 29, 2018, 7:44 AM Lalwani, Jayesh <jayesh.lalw...@capitalone.com<mailto:jayesh.lalw...@capitalone.com>> wrote: Without knowing too many details, I can only guess. It could be that Spark is creating a lot of tasks even though there are less records. Creation and distribution of tasks has a noticeable overhead on smaller datasets. You might want to look at the driver logs, or the Spark Application Detail UI. From: Tin Vu <tvu...@ucr.edu<mailto:tvu...@ucr.edu>> Date: Wednesday, March 28, 2018 at 8:04 PM To: "user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto Hi, I am executing a benchmark to compare performance of SparkSQL, Apache Drill and Presto. My experimental setup: • TPCDS dataset with scale factor 100 (size 100GB). • Spark, Drill, Presto have a same number of workers: 12. • Each worked has same allocated amount of memory: 4GB. • Data is stored by Hive with ORC format. I executed a very simple SQL query: "SELECT * from table_name" The issue is that for some small size tables (even table with few dozen of records), SparkSQL still required about 7-8 seconds to finish, while Drill and Presto only needed less than 1 second. For other large tables with billions records, SparkSQL performance was reasonable when it required 20-30 seconds to scan the whole table. Do you have any idea or reasonable explanation for this issue? Thanks, The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto
You are right. There are too much tasks was created. How can we reduce the number of tasks? On Thu, Mar 29, 2018, 7:44 AM Lalwani, Jayesh <jayesh.lalw...@capitalone.com> wrote: > Without knowing too many details, I can only guess. It could be that Spark > is creating a lot of tasks even though there are less records. Creation and > distribution of tasks has a noticeable overhead on smaller datasets. > > > > You might want to look at the driver logs, or the Spark Application Detail > UI. > > > > *From: *Tin Vu <tvu...@ucr.edu> > *Date: *Wednesday, March 28, 2018 at 8:04 PM > *To: *"user@spark.apache.org" <user@spark.apache.org> > *Subject: *[SparkSQL] SparkSQL performance on small TPCDS tables is very > low when compared to Drill or Presto > > > > Hi, > > > > I am executing a benchmark to compare performance of SparkSQL, Apache > Drill and Presto. My experimental setup: > > · TPCDS dataset with scale factor 100 (size 100GB). > > · Spark, Drill, Presto have a same number of workers: 12. > > · Each worked has same allocated amount of memory: 4GB. > > · Data is stored by Hive with ORC format. > > I executed a very simple SQL query: "SELECT * from table_name" > The issue is that for some small size tables (even table with few dozen of > records), SparkSQL still required about 7-8 seconds to finish, while Drill > and Presto only needed less than 1 second. > For other large tables with billions records, SparkSQL performance was > reasonable when it required 20-30 seconds to scan the whole table. > Do you have any idea or reasonable explanation for this issue? > > Thanks, > > > > -- > > The information contained in this e-mail is confidential and/or > proprietary to Capital One and/or its affiliates and may only be used > solely in performance of work or services for Capital One. The information > transmitted herewith is intended only for use by the individual or entity > to which it is addressed. If the reader of this message is not the intended > recipient, you are hereby notified that any review, retransmission, > dissemination, distribution, copying or other use of, or taking of any > action in reliance upon this information is strictly prohibited. If you > have received this communication in error, please contact the sender and > delete the material from your computer. >
Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto
Without knowing too many details, I can only guess. It could be that Spark is creating a lot of tasks even though there are less records. Creation and distribution of tasks has a noticeable overhead on smaller datasets. You might want to look at the driver logs, or the Spark Application Detail UI. From: Tin Vu <tvu...@ucr.edu> Date: Wednesday, March 28, 2018 at 8:04 PM To: "user@spark.apache.org" <user@spark.apache.org> Subject: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto Hi, I am executing a benchmark to compare performance of SparkSQL, Apache Drill and Presto. My experimental setup: · TPCDS dataset with scale factor 100 (size 100GB). · Spark, Drill, Presto have a same number of workers: 12. · Each worked has same allocated amount of memory: 4GB. · Data is stored by Hive with ORC format. I executed a very simple SQL query: "SELECT * from table_name" The issue is that for some small size tables (even table with few dozen of records), SparkSQL still required about 7-8 seconds to finish, while Drill and Presto only needed less than 1 second. For other large tables with billions records, SparkSQL performance was reasonable when it required 20-30 seconds to scan the whole table. Do you have any idea or reasonable explanation for this issue? Thanks, The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto
Thanks for your response. What do you mean when you said "immediately return"? On Wed, Mar 28, 2018, 10:33 PM Jörn Franke <jornfra...@gmail.com> wrote: > I don’t think select * is a good benchmark. You should do a more complex > operation, otherwise optimizes might see that you don’t do anything in the > query and immediately return (similarly count might immediately return by > using some statistics). > > On 29. Mar 2018, at 02:03, Tin Vu <tvu...@ucr.edu> wrote: > > Hi, > > I am executing a benchmark to compare performance of SparkSQL, Apache > Drill and Presto. My experimental setup: > >- TPCDS dataset with scale factor 100 (size 100GB). >- Spark, Drill, Presto have a same number of workers: 12. >- Each worked has same allocated amount of memory: 4GB. >- Data is stored by Hive with ORC format. > > I executed a very simple SQL query: "SELECT * from table_name" > The issue is that for some small size tables (even table with few dozen of > records), SparkSQL still required about 7-8 seconds to finish, while Drill > and Presto only needed less than 1 second. > For other large tables with billions records, SparkSQL performance was > reasonable when it required 20-30 seconds to scan the whole table. > Do you have any idea or reasonable explanation for this issue? > > Thanks, > >
Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto
I don’t think select * is a good benchmark. You should do a more complex operation, otherwise optimizes might see that you don’t do anything in the query and immediately return (similarly count might immediately return by using some statistics). > On 29. Mar 2018, at 02:03, Tin Vu <tvu...@ucr.edu> wrote: > > Hi, > > I am executing a benchmark to compare performance of SparkSQL, Apache Drill > and Presto. My experimental setup: > TPCDS dataset with scale factor 100 (size 100GB). > Spark, Drill, Presto have a same number of workers: 12. > Each worked has same allocated amount of memory: 4GB. > Data is stored by Hive with ORC format. > I executed a very simple SQL query: "SELECT * from table_name" > The issue is that for some small size tables (even table with few dozen of > records), SparkSQL still required about 7-8 seconds to finish, while Drill > and Presto only needed less than 1 second. > For other large tables with billions records, SparkSQL performance was > reasonable when it required 20-30 seconds to scan the whole table. > Do you have any idea or reasonable explanation for this issue? > Thanks, >
[SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto
Hi, I am executing a benchmark to compare performance of SparkSQL, Apache Drill and Presto. My experimental setup: - TPCDS dataset with scale factor 100 (size 100GB). - Spark, Drill, Presto have a same number of workers: 12. - Each worked has same allocated amount of memory: 4GB. - Data is stored by Hive with ORC format. I executed a very simple SQL query: "SELECT * from table_name" The issue is that for some small size tables (even table with few dozen of records), SparkSQL still required about 7-8 seconds to finish, while Drill and Presto only needed less than 1 second. For other large tables with billions records, SparkSQL performance was reasonable when it required 20-30 seconds to scan the whole table. Do you have any idea or reasonable explanation for this issue? Thanks,
Re: Spark Streaming - Multiple Spark Contexts (SparkSQL) Performance
Hammad, The recommended way to implement this logic would be to: Create a SparkSession. Create a Streaming Context using the SparkContext embedded in the SparkSession Use the single SparkSession instance for the SQL operations within the foreachRDD. It's important to note that spark operations can process the complete dataset. In this case, there's no need to do a perPartition or perElement operation. (that would be the case if we were directly using the drivers API and DB connections) Reorganizing the code in the question a bit, we should have: SparkSession sparkSession = SparkSession .builder() .setMaster("local[2]").setAppName("TransformerStreamPOC") .config("spark.some.config.option", "some-value") .getOrCreate(); JavaStreamingContext jssc = new JavaStreamingContext(sparkSession.sparkContext, Durations.seconds(60)); // this dataset doesn't seem to depend on the received data, so we can load it once. Dataset baselineData = sparkSession.read().jdbc(MYSQL_CONNECTION_URL, "table_name", connectionProperties); // create dstream DStream dstream = ... ... operations on dstream... dstream.foreachRDD { rdd => Dataset incomingData = sparkSession.createDataset(rdd) ... do something the incoming dataset, eg. join with the baseline ... DataFrame joined = incomingData.join(baselineData, ...) ... do something with joined ... } kr, Gerard. On Sun, Oct 1, 2017 at 7:55 PM, Hammadwrote: > Hello, > > *Background:* > > I have Spark Streaming context; > > SparkConf conf = new > SparkConf().setMaster("local[2]").setAppName("TransformerStreamPOC"); > conf.set("spark.driver.allowMultipleContexts", "true"); *<== this* > JavaStreamingContext jssc = new JavaStreamingContext(conf, > Durations.seconds(60)); > > > that subscribes to certain kafka *topics*; > > JavaInputDStream > stream = > KafkaUtils.createDirectStream( > jssc, > LocationStrategies.PreferConsistent(), > ConsumerStrategies. Subscribe(*topics*, > kafkaParams) > ); > > when messages arrive in queue, I recursively process them as follows (below > code section will repeat in Question statement) > > stream.foreachRDD(rdd -> { > //process here - below two scenarions code is inserted here > > }); > > > *Question starts here:* > > Since I need to apply SparkSQL to received events in Queue - I create > SparkSession with two scenarios; > > *1) Per partition one sparkSession (after > "spark.driver.allowMultipleContexts" set to true); so all events under this > partition are handled by same sparkSession* > > rdd.foreachPartition(partition -> { > SparkSession sparkSession = SparkSession > .builder() > .appName("Java Spark SQL basic example") > .config("spark.some.config.option", "some-value") > .getOrCreate(); > > while (partition.hasNext()) { > Dataset df = sparkSession.read().jdbc(MYSQL_CONNECTION_URL, > "table_name", connectionProperties); > > }} > > *2) Per event under each session; so each event under each queue under each > stream has one sparkSession;* > > rdd.foreachPartition(partition -> {while (partition.hasNext()) { > SparkSession sparkSession = SparkSession.builder().appName("Java Spark SQL > basic example").config("spark.some.config.option", > "some-value").getOrCreate(); > > Dataset df = sparkSession.read().jdbc(MYSQL_CONNECTION_URL, > "table_name", connectionProperties); > > }} > > > Is it good practice to create multiple contexts (lets say 10 or 100)? > How does number of sparkContext to be allowed vs number of worker nodes > relate? > What are performance considerations with respect to scenario1 and > scenario2? > > I am looking for these answers as I feel there is more to what I > understand of performance w.r.t sparkContexts created by a streaming > application. > Really appreciate your support in anticipation. > > Hammad > >
Fwd: Spark Streaming - Multiple Spark Contexts (SparkSQL) Performance
Hello, *Background:* I have Spark Streaming context; SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("TransformerStreamPOC"); conf.set("spark.driver.allowMultipleContexts", "true"); *<== this* JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(60)); that subscribes to certain kafka *topics*; JavaInputDStream> stream = KafkaUtils.createDirectStream( jssc, LocationStrategies.PreferConsistent(), ConsumerStrategies. Subscribe(*topics*, kafkaParams) ); when messages arrive in queue, I recursively process them as follows (below code section will repeat in Question statement) stream.foreachRDD(rdd -> { //process here - below two scenarions code is inserted here }); *Question starts here:* Since I need to apply SparkSQL to received events in Queue - I create SparkSession with two scenarios; *1) Per partition one sparkSession (after "spark.driver.allowMultipleContexts" set to true); so all events under this partition are handled by same sparkSession* rdd.foreachPartition(partition -> { SparkSession sparkSession = SparkSession .builder() .appName("Java Spark SQL basic example") .config("spark.some.config.option", "some-value") .getOrCreate(); while (partition.hasNext()) { Dataset df = sparkSession.read().jdbc(MYSQL_CONNECTION_URL, "table_name", connectionProperties); }} *2) Per event under each session; so each event under each queue under each stream has one sparkSession;* rdd.foreachPartition(partition -> {while (partition.hasNext()) { SparkSession sparkSession = SparkSession.builder().appName("Java Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate(); Dataset df = sparkSession.read().jdbc(MYSQL_CONNECTION_URL, "table_name", connectionProperties); }} Is it good practice to create multiple contexts (lets say 10 or 100)? How does number of sparkContext to be allowed vs number of worker nodes relate? What are performance considerations with respect to scenario1 and scenario2? I am looking for these answers as I feel there is more to what I understand of performance w.r.t sparkContexts created by a streaming application. Really appreciate your support in anticipation. Hammad
Re: SparkSQL performance
https://github.com/databricks/spark-avro On Tue, Apr 21, 2015 at 3:09 PM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Thanks Michael! I have tried applying my schema programatically but I didn't get any improvement on performance :( Could you point me to some code examples using Avro please? Many thanks again! Renato M. 2015-04-21 20:45 GMT+02:00 Michael Armbrust mich...@databricks.com: Here is an example using rows directly: https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema Avro or parquet input would likely give you the best performance. On Tue, Apr 21, 2015 at 4:28 AM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Thanks for the hints guys! much appreciated! Even if I just do a something like: Select * from tableX where attribute1 5 I see similar behaviour. @Michael Could you point me to any sample code that uses Spark's Rows? We are at a phase where we can actually change our JavaBeans for something that provides a better performance than what we are seeing now. Would you recommend using Avro presentation then? Thanks again! Renato M. 2015-04-21 1:18 GMT+02:00 Michael Armbrust mich...@databricks.com: There is a cost to converting from JavaBeans to Rows and this code path has not been optimized. That is likely what you are seeing. On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote: SparkSQL optimizes better by column pruning and predicate pushdown, primarily. Here you are not taking advantage of either. I am curious to know what goes in your filter function, as you are not using a filter in SQL side. Best Ayan On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Does anybody have an idea? a clue? a hint? Thanks! Renato M. 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com: Hi all, I have a simple query Select * from tableX where attribute1 between 0 and 5 that I run over a Kryo file with four partitions that ends up being around 3.5 million rows in our case. If I run this query by doing a simple map().filter() it takes around ~9.6 seconds but when I apply schema, register the table into a SqlContext, and then run the query, it takes around ~16 seconds. This is using Spark 1.2.1 with Scala 2.10.0 I am wondering why there is such a big gap on performance if it is just a filter. Internally, the relation files are mapped to a JavaBean. This different data presentation (JavaBeans vs SparkSQL internal representation) could lead to such difference? Is there anything I could do to make the performance get closer to the hard-coded option? Thanks in advance for any suggestions or ideas. Renato M.
Re: SparkSQL performance
Here is an example using rows directly: https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema Avro or parquet input would likely give you the best performance. On Tue, Apr 21, 2015 at 4:28 AM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Thanks for the hints guys! much appreciated! Even if I just do a something like: Select * from tableX where attribute1 5 I see similar behaviour. @Michael Could you point me to any sample code that uses Spark's Rows? We are at a phase where we can actually change our JavaBeans for something that provides a better performance than what we are seeing now. Would you recommend using Avro presentation then? Thanks again! Renato M. 2015-04-21 1:18 GMT+02:00 Michael Armbrust mich...@databricks.com: There is a cost to converting from JavaBeans to Rows and this code path has not been optimized. That is likely what you are seeing. On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote: SparkSQL optimizes better by column pruning and predicate pushdown, primarily. Here you are not taking advantage of either. I am curious to know what goes in your filter function, as you are not using a filter in SQL side. Best Ayan On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Does anybody have an idea? a clue? a hint? Thanks! Renato M. 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com: Hi all, I have a simple query Select * from tableX where attribute1 between 0 and 5 that I run over a Kryo file with four partitions that ends up being around 3.5 million rows in our case. If I run this query by doing a simple map().filter() it takes around ~9.6 seconds but when I apply schema, register the table into a SqlContext, and then run the query, it takes around ~16 seconds. This is using Spark 1.2.1 with Scala 2.10.0 I am wondering why there is such a big gap on performance if it is just a filter. Internally, the relation files are mapped to a JavaBean. This different data presentation (JavaBeans vs SparkSQL internal representation) could lead to such difference? Is there anything I could do to make the performance get closer to the hard-coded option? Thanks in advance for any suggestions or ideas. Renato M.
Re: SparkSQL performance
Thanks Michael! I have tried applying my schema programatically but I didn't get any improvement on performance :( Could you point me to some code examples using Avro please? Many thanks again! Renato M. 2015-04-21 20:45 GMT+02:00 Michael Armbrust mich...@databricks.com: Here is an example using rows directly: https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema Avro or parquet input would likely give you the best performance. On Tue, Apr 21, 2015 at 4:28 AM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Thanks for the hints guys! much appreciated! Even if I just do a something like: Select * from tableX where attribute1 5 I see similar behaviour. @Michael Could you point me to any sample code that uses Spark's Rows? We are at a phase where we can actually change our JavaBeans for something that provides a better performance than what we are seeing now. Would you recommend using Avro presentation then? Thanks again! Renato M. 2015-04-21 1:18 GMT+02:00 Michael Armbrust mich...@databricks.com: There is a cost to converting from JavaBeans to Rows and this code path has not been optimized. That is likely what you are seeing. On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote: SparkSQL optimizes better by column pruning and predicate pushdown, primarily. Here you are not taking advantage of either. I am curious to know what goes in your filter function, as you are not using a filter in SQL side. Best Ayan On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Does anybody have an idea? a clue? a hint? Thanks! Renato M. 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com: Hi all, I have a simple query Select * from tableX where attribute1 between 0 and 5 that I run over a Kryo file with four partitions that ends up being around 3.5 million rows in our case. If I run this query by doing a simple map().filter() it takes around ~9.6 seconds but when I apply schema, register the table into a SqlContext, and then run the query, it takes around ~16 seconds. This is using Spark 1.2.1 with Scala 2.10.0 I am wondering why there is such a big gap on performance if it is just a filter. Internally, the relation files are mapped to a JavaBean. This different data presentation (JavaBeans vs SparkSQL internal representation) could lead to such difference? Is there anything I could do to make the performance get closer to the hard-coded option? Thanks in advance for any suggestions or ideas. Renato M.
Re: SparkSQL performance
Thanks for the hints guys! much appreciated! Even if I just do a something like: Select * from tableX where attribute1 5 I see similar behaviour. @Michael Could you point me to any sample code that uses Spark's Rows? We are at a phase where we can actually change our JavaBeans for something that provides a better performance than what we are seeing now. Would you recommend using Avro presentation then? Thanks again! Renato M. 2015-04-21 1:18 GMT+02:00 Michael Armbrust mich...@databricks.com: There is a cost to converting from JavaBeans to Rows and this code path has not been optimized. That is likely what you are seeing. On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote: SparkSQL optimizes better by column pruning and predicate pushdown, primarily. Here you are not taking advantage of either. I am curious to know what goes in your filter function, as you are not using a filter in SQL side. Best Ayan On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Does anybody have an idea? a clue? a hint? Thanks! Renato M. 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com: Hi all, I have a simple query Select * from tableX where attribute1 between 0 and 5 that I run over a Kryo file with four partitions that ends up being around 3.5 million rows in our case. If I run this query by doing a simple map().filter() it takes around ~9.6 seconds but when I apply schema, register the table into a SqlContext, and then run the query, it takes around ~16 seconds. This is using Spark 1.2.1 with Scala 2.10.0 I am wondering why there is such a big gap on performance if it is just a filter. Internally, the relation files are mapped to a JavaBean. This different data presentation (JavaBeans vs SparkSQL internal representation) could lead to such difference? Is there anything I could do to make the performance get closer to the hard-coded option? Thanks in advance for any suggestions or ideas. Renato M.
Re: SparkSQL performance
Does anybody have an idea? a clue? a hint? Thanks! Renato M. 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com: Hi all, I have a simple query Select * from tableX where attribute1 between 0 and 5 that I run over a Kryo file with four partitions that ends up being around 3.5 million rows in our case. If I run this query by doing a simple map().filter() it takes around ~9.6 seconds but when I apply schema, register the table into a SqlContext, and then run the query, it takes around ~16 seconds. This is using Spark 1.2.1 with Scala 2.10.0 I am wondering why there is such a big gap on performance if it is just a filter. Internally, the relation files are mapped to a JavaBean. This different data presentation (JavaBeans vs SparkSQL internal representation) could lead to such difference? Is there anything I could do to make the performance get closer to the hard-coded option? Thanks in advance for any suggestions or ideas. Renato M.
Re: SparkSQL performance
SparkSQL optimizes better by column pruning and predicate pushdown, primarily. Here you are not taking advantage of either. I am curious to know what goes in your filter function, as you are not using a filter in SQL side. Best Ayan On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Does anybody have an idea? a clue? a hint? Thanks! Renato M. 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com: Hi all, I have a simple query Select * from tableX where attribute1 between 0 and 5 that I run over a Kryo file with four partitions that ends up being around 3.5 million rows in our case. If I run this query by doing a simple map().filter() it takes around ~9.6 seconds but when I apply schema, register the table into a SqlContext, and then run the query, it takes around ~16 seconds. This is using Spark 1.2.1 with Scala 2.10.0 I am wondering why there is such a big gap on performance if it is just a filter. Internally, the relation files are mapped to a JavaBean. This different data presentation (JavaBeans vs SparkSQL internal representation) could lead to such difference? Is there anything I could do to make the performance get closer to the hard-coded option? Thanks in advance for any suggestions or ideas. Renato M.
Re: SparkSQL performance
There is a cost to converting from JavaBeans to Rows and this code path has not been optimized. That is likely what you are seeing. On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote: SparkSQL optimizes better by column pruning and predicate pushdown, primarily. Here you are not taking advantage of either. I am curious to know what goes in your filter function, as you are not using a filter in SQL side. Best Ayan On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Does anybody have an idea? a clue? a hint? Thanks! Renato M. 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com: Hi all, I have a simple query Select * from tableX where attribute1 between 0 and 5 that I run over a Kryo file with four partitions that ends up being around 3.5 million rows in our case. If I run this query by doing a simple map().filter() it takes around ~9.6 seconds but when I apply schema, register the table into a SqlContext, and then run the query, it takes around ~16 seconds. This is using Spark 1.2.1 with Scala 2.10.0 I am wondering why there is such a big gap on performance if it is just a filter. Internally, the relation files are mapped to a JavaBean. This different data presentation (JavaBeans vs SparkSQL internal representation) could lead to such difference? Is there anything I could do to make the performance get closer to the hard-coded option? Thanks in advance for any suggestions or ideas. Renato M.
SparkSQL performance
Hi all, I have a simple query Select * from tableX where attribute1 between 0 and 5 that I run over a Kryo file with four partitions that ends up being around 3.5 million rows in our case. If I run this query by doing a simple map().filter() it takes around ~9.6 seconds but when I apply schema, register the table into a SqlContext, and then run the query, it takes around ~16 seconds. This is using Spark 1.2.1 with Scala 2.10.0 I am wondering why there is such a big gap on performance if it is just a filter. Internally, the relation files are mapped to a JavaBean. This different data presentation (JavaBeans vs SparkSQL internal representation) could lead to such difference? Is there anything I could do to make the performance get closer to the hard-coded option? Thanks in advance for any suggestions or ideas. Renato M.
SparkSQL Performance Tuning Options
Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db. Use case is Spark Yarn app will start and serve as query server for multiple users i.e. always up and running. At startup, there is option to cache data and also pre-compute some results sets, hash maps etc. that would be likely be asked by client APIs. I.e there is some option to use startup time to precompute/cache - but query response time requirement on large data set is very stringent Hoping to use SparkSQL (but a combination of SQL and RDD APIs is also OK). * Does SparkSQL execution uses underlying partition information ? (Data is from HDFS) * Are there any ways to give hints to the SparkSQL execution about any precomputed/pre-cached RDDs? * Packages spark.sql.execution, spark.sql.execution.joins and other sql.xxx packages - would using these for tuning query plan is recommended? Would like to keep this as-needed if possible * Features not in current release but scheduled for upcoming release would also be good to know. Thanks, PS: This is not a small topic so if someone prefers to start a offline thread on details, I can do that and summarize the conclusions back to this thread.
Re: SparkSQL Performance Tuning Options
On 1/27/15 5:55 PM, Cheng Lian wrote: On 1/27/15 11:38 AM, Manoj Samel wrote: Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db. Use case is Spark Yarn app will start and serve as query server for multiple users i.e. always up and running. At startup, there is option to cache data and also pre-compute some results sets, hash maps etc. that would be likely be asked by client APIs. I.e there is some option to use startup time to precompute/cache - but query response time requirement on large data set is very stringent Hoping to use SparkSQL (but a combination of SQL and RDD APIs is also OK). * Does SparkSQL execution uses underlying partition information ? (Data is from HDFS) No. For example, if the underlying data has already been partitioned by some key, Spark SQL doesn't know it, and can't leverage that information to avoid shuffle when doing aggregation on that key. However, partitioning the data ahead of time does help minimizing shuffle network IO. There's a JIRA ticket to enable Spark SQL aware of underlying data distribution. Maybe you are asking about locality? If that's the case, just want to add that Spark SQL does understand locality information of the underlying data. It's obtained from Hadoop InputFormat. * Are there any ways to give hints to the SparkSQL execution about any precomputed/pre-cached RDDs? Instead of caching raw RDD, it's recommended to transform raw RDD to SchemaRDD and then cache it, so that in-memory columnar storage can be used. Also Spark SQL recognizes cached SchemaRDDs automatically. * Packages spark.sql.execution, spark.sql.execution.joins and other sql.xxx packages - would using these for tuning query plan is recommended? Would like to keep this as-needed if possible Not sure whether I understood this question. Are you trying to use internal APIs to do customized optimizations? * Features not in current release but scheduled for upcoming release would also be good to know. Thanks, PS: This is not a small topic so if someone prefers to start a offline thread on details, I can do that and summarize the conclusions back to this thread. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkSQL performance
I did some simple experiments with Impala and Spark, and Impala came out ahead. But it’s also less flexible, couldn’t handle irregular schemas, didn't support Json, and so on. On 01.11.2014, at 02:20, Soumya Simanta soumya.sima...@gmail.com wrote: I agree. My personal experience with Spark core is that it performs really well once you tune it properly. As far I understand SparkSQL under the hood performs many of these optimizations (order of Spark operations) and uses a more efficient storage format. Is this assumption correct? Has anyone done any comparison of SparkSQL with Impala ? The fact that many of the queries don't even finish in the benchmark is quite surprising and hard to believe. A few months ago there were a few emails about Spark not being able to handle large volumes (TBs) of data. That myth was busted recently when the folks at Databricks published their sorting record results. Thanks -Soumya On Fri, Oct 31, 2014 at 7:35 PM, Du Li l...@yahoo-inc.com wrote: We have seen all kinds of results published that often contradict each other. My take is that the authors often know more tricks about how to tune their own/familiar products than the others. So the product on focus is tuned for ideal performance while the competitors are not. The authors are not necessarily biased but as a consequence the results are. Ideally it’s critical for the user community to be informed of all the in-depth tuning tricks of all products. However, realistically, there is a big gap in terms of documentation. Hope the Spark folks will make a difference. :-) Du From: Soumya Simanta soumya.sima...@gmail.com Date: Friday, October 31, 2014 at 4:04 PM To: user@spark.apache.org user@spark.apache.org Subject: SparkSQL performance I was really surprised to see the results here, esp. SparkSQL not completing http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style I was under the impression that SparkSQL performs really well because it can optimize the RDD operations and load only the columns that are required. This essentially means in most cases SparkSQL should be as fast as Spark is. I would be very interested to hear what others in the group have to say about this. Thanks -Soumya
SparkSQL performance
I was really surprised to see the results here, esp. SparkSQL not completing http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style I was under the impression that SparkSQL performs really well because it can optimize the RDD operations and load only the columns that are required. This essentially means in most cases SparkSQL should be as fast as Spark is. I would be very interested to hear what others in the group have to say about this. Thanks -Soumya
Re: SparkSQL performance
We have seen all kinds of results published that often contradict each other. My take is that the authors often know more tricks about how to tune their own/familiar products than the others. So the product on focus is tuned for ideal performance while the competitors are not. The authors are not necessarily biased but as a consequence the results are. Ideally it’s critical for the user community to be informed of all the in-depth tuning tricks of all products. However, realistically, there is a big gap in terms of documentation. Hope the Spark folks will make a difference. :-) Du From: Soumya Simanta soumya.sima...@gmail.commailto:soumya.sima...@gmail.com Date: Friday, October 31, 2014 at 4:04 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: SparkSQL performance I was really surprised to see the results here, esp. SparkSQL not completing http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style I was under the impression that SparkSQL performs really well because it can optimize the RDD operations and load only the columns that are required. This essentially means in most cases SparkSQL should be as fast as Spark is. I would be very interested to hear what others in the group have to say about this. Thanks -Soumya
Re: SparkSQL performance
I agree. My personal experience with Spark core is that it performs really well once you tune it properly. As far I understand SparkSQL under the hood performs many of these optimizations (order of Spark operations) and uses a more efficient storage format. Is this assumption correct? Has anyone done any comparison of SparkSQL with Impala ? The fact that many of the queries don't even finish in the benchmark is quite surprising and hard to believe. A few months ago there were a few emails about Spark not being able to handle large volumes (TBs) of data. That myth was busted recently when the folks at Databricks published their sorting record results. Thanks -Soumya On Fri, Oct 31, 2014 at 7:35 PM, Du Li l...@yahoo-inc.com wrote: We have seen all kinds of results published that often contradict each other. My take is that the authors often know more tricks about how to tune their own/familiar products than the others. So the product on focus is tuned for ideal performance while the competitors are not. The authors are not necessarily biased but as a consequence the results are. Ideally it’s critical for the user community to be informed of all the in-depth tuning tricks of all products. However, realistically, there is a big gap in terms of documentation. Hope the Spark folks will make a difference. :-) Du From: Soumya Simanta soumya.sima...@gmail.com Date: Friday, October 31, 2014 at 4:04 PM To: user@spark.apache.org user@spark.apache.org Subject: SparkSQL performance I was really surprised to see the results here, esp. SparkSQL not completing http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style I was under the impression that SparkSQL performs really well because it can optimize the RDD operations and load only the columns that are required. This essentially means in most cases SparkSQL should be as fast as Spark is. I would be very interested to hear what others in the group have to say about this. Thanks -Soumya