Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-31 Thread Tin Vu
Hi Gaurav,

Thank you for your response. This is the answer for your questions:
1. Spark 2.3.0
2. I was using 'spark-sql' command, for example: 'spark-sql --master
spark:/*:7077 --database tpcds_bin_partitioned_orc_100 -f $file_name' wih
file_name is the file that contains SQL script ("select * from table_name").
3. Hadoop 2.9.0

I am using JDBS connector to Drill from Hive Metastore. SparkSQL is also
connecting to ORC database by Hive.

Thanks so much!

Tin

On Sat, Mar 31, 2018 at 11:41 AM, Gourav Sengupta <gourav.sengu...@gmail.com
> wrote:

> Hi Tin,
>
> This sounds interesting. While I would prefer to think that Presto and
> Drill have
>
> can you please provide the following details:
> 1. SPARK version
> 2. The exact code used in SPARK (the full code that was used)
> 3. HADOOP version
>
> I do think that SPARK and DRILL have complementary and different used
> cases. Have you tried using JDBC connector to Drill from within SPARKSQL?
>
> Regards,
> Gourav Sengupta
>
>
> On Thu, Mar 29, 2018 at 1:03 AM, Tin Vu <tvu...@ucr.edu> wrote:
>
>> Hi,
>>
>> I am executing a benchmark to compare performance of SparkSQL, Apache
>> Drill and Presto. My experimental setup:
>>
>>- TPCDS dataset with scale factor 100 (size 100GB).
>>- Spark, Drill, Presto have a same number of workers: 12.
>>- Each worked has same allocated amount of memory: 4GB.
>>- Data is stored by Hive with ORC format.
>>
>> I executed a very simple SQL query: "SELECT * from table_name"
>> The issue is that for some small size tables (even table with few dozen
>> of records), SparkSQL still required about 7-8 seconds to finish, while
>> Drill and Presto only needed less than 1 second.
>> For other large tables with billions records, SparkSQL performance was
>> reasonable when it required 20-30 seconds to scan the whole table.
>> Do you have any idea or reasonable explanation for this issue?
>>
>> Thanks,
>>
>>
>


Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-29 Thread Tin Vu
 You are right. There are too much tasks was created. How can we reduce the
number of tasks?

On Thu, Mar 29, 2018, 7:44 AM Lalwani, Jayesh <jayesh.lalw...@capitalone.com>
wrote:

> Without knowing too many details, I can only guess. It could be that Spark
> is creating a lot of tasks even though there are less records. Creation and
> distribution of tasks has a noticeable overhead on smaller datasets.
>
>
>
> You might want to look at the driver logs, or the Spark Application Detail
> UI.
>
>
>
> *From: *Tin Vu <tvu...@ucr.edu>
> *Date: *Wednesday, March 28, 2018 at 8:04 PM
> *To: *"user@spark.apache.org" <user@spark.apache.org>
> *Subject: *[SparkSQL] SparkSQL performance on small TPCDS tables is very
> low when compared to Drill or Presto
>
>
>
> Hi,
>
>
>
> I am executing a benchmark to compare performance of SparkSQL, Apache
> Drill and Presto. My experimental setup:
>
> · TPCDS dataset with scale factor 100 (size 100GB).
>
> · Spark, Drill, Presto have a same number of workers: 12.
>
> · Each worked has same allocated amount of memory: 4GB.
>
> · Data is stored by Hive with ORC format.
>
> I executed a very simple SQL query: "SELECT * from table_name"
> The issue is that for some small size tables (even table with few dozen of
> records), SparkSQL still required about 7-8 seconds to finish, while Drill
> and Presto only needed less than 1 second.
> For other large tables with billions records, SparkSQL performance was
> reasonable when it required 20-30 seconds to scan the whole table.
> Do you have any idea or reasonable explanation for this issue?
>
> Thanks,
>
>
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>


Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Tin Vu
Thanks for your response.  What do you mean when you said "immediately
return"?

On Wed, Mar 28, 2018, 10:33 PM Jörn Franke <jornfra...@gmail.com> wrote:

> I don’t think select * is a good benchmark. You should do a more complex
> operation, otherwise optimizes might see that you don’t do anything in the
> query and immediately return (similarly count might immediately return by
> using some statistics).
>
> On 29. Mar 2018, at 02:03, Tin Vu <tvu...@ucr.edu> wrote:
>
> Hi,
>
> I am executing a benchmark to compare performance of SparkSQL, Apache
> Drill and Presto. My experimental setup:
>
>- TPCDS dataset with scale factor 100 (size 100GB).
>- Spark, Drill, Presto have a same number of workers: 12.
>- Each worked has same allocated amount of memory: 4GB.
>- Data is stored by Hive with ORC format.
>
> I executed a very simple SQL query: "SELECT * from table_name"
> The issue is that for some small size tables (even table with few dozen of
> records), SparkSQL still required about 7-8 seconds to finish, while Drill
> and Presto only needed less than 1 second.
> For other large tables with billions records, SparkSQL performance was
> reasonable when it required 20-30 seconds to scan the whole table.
> Do you have any idea or reasonable explanation for this issue?
>
> Thanks,
>
>


[SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Tin Vu
Hi,

I am executing a benchmark to compare performance of SparkSQL, Apache Drill
and Presto. My experimental setup:

   - TPCDS dataset with scale factor 100 (size 100GB).
   - Spark, Drill, Presto have a same number of workers: 12.
   - Each worked has same allocated amount of memory: 4GB.
   - Data is stored by Hive with ORC format.

I executed a very simple SQL query: "SELECT * from table_name"
The issue is that for some small size tables (even table with few dozen of
records), SparkSQL still required about 7-8 seconds to finish, while Drill
and Presto only needed less than 1 second.
For other large tables with billions records, SparkSQL performance was
reasonable when it required 20-30 seconds to scan the whole table.
Do you have any idea or reasonable explanation for this issue?

Thanks,