Re: Time of Multiple Joins in AsterixDB

Ian Maxon Tue, 20 Dec 2016 17:02:13 -0800

The dev list generally strips attachments. Maybe you can just put the
config inline? Or link to a pastebin/gist?


On Tue, Dec 20, 2016 at 4:45 PM, mingda li <[email protected]> wrote:

> Oh, I think you can find the figure in the first email.
> I can attach the three files again here. (BaselineDifferentDB.eps; 
> cc_conf.pdf;
> CompleteQuery.pdf)
> Sorry for making any confusion.
>
>
> On Tue, Dec 20, 2016 at 4:30 PM, Yingyi Bu <[email protected]> wrote:
>
>> Hi Mingda,
>>
>>      It looks that you didn't attach the pdf?
>>      Thanks!
>>
>> Best,
>> Yingyi
>>
>> On Tue, Dec 20, 2016 at 4:15 PM, mingda li <[email protected]>
>> wrote:
>>
>> > Sorry for the wrong version of cc.conf. I convert it to pdf version as
>> > attachment.
>> >
>> > On Tue, Dec 20, 2016 at 4:06 PM, mingda li <[email protected]>
>> wrote:
>> >
>> >> Dear all,
>> >>
>> >> I am testing different systems' (AsterixDB, Spark, Hive, Pig) multiple
>> >> joins to see if there is a big difference with different join order.
>> This
>> >> is the reason for our research on multiple join and the result will
>> apppear
>> >> in our paper which is to be submitted to VLDB soon. Could you help us
>> to
>> >> make sure that the test results make sense for AsterixDB?
>> >>
>> >> We configure the AsterixDB 0.8.9 ( use asterix-server-0.8.9-SNAPSHOT-
>> binary-assembly)
>> >> in our cluster of 16 machines, each with a 3.40GHz i7 processor (4
>> cores
>> >> and 2 hyper-threads per core), 32GB of RAM and 1TB of disk capacity.
>> The
>> >> operating system is 64-bit Ubuntu 12.04. JDK version 1.8.0. During
>> >> configuration, I follow the NCService instruction here
>> >> https://ci.apache.org/projects/asterixdb/ncservice.html. And I set the
>> >> cc.conf as in attachment. (Each node work as nc and the first node also
>> >> work as cc).
>> >>
>> >> For experiment, we use 3 fact tables from TPC-DS: inventory;
>> >> catalog_sales; catalog_returns with TPC-DS scale factor 1g and 10g. The
>> >> multiple join query we use in AsterixDB are as following:
>> >>
>> >> Good Join Order: *SELECT COUNT(*) FROM (SELECT * FROM catalog_sales cs1
>> >> JOIN catalog_returns cr1*
>> >> * ON (cs1.cs_order_number = cr1.cr_order_number AND cs1.cs_item_sk =
>> >> cr1.cr_item_sk))  m1 JOIN inventory i1 ON i1.inv_item_sk =
>> cs1.cs_item_sk;*
>> >>
>> >> Bad Join Order: *SELECT COUNT(*) FROM (SELECT * FROM catalog_sales cs1
>> >> JOIN inventory i1 ON cs1.cs_item_sk = i1.inv_item_sk) m1 JOIN
>> >> catalog_returns cr1 ON (cs1.cs_order_number = cr1.cr_order_number AND
>> >> cs1.cs_item_sk = cr1.cr_item_sk);*
>> >>
>> >> We load the data to AsterixDB firstly and run the two different
>> queries.
>> >> (The complete version of all queries for AsterixDB is in attachment)
>> We
>> >> assume the data has already been stored in AsterixDB and only count the
>> >> time for multiple join.
>> >>
>> >> Meanwhile, we use the same dataset and query to test Spark, Pig and
>> Hive.
>> >> The result is shown in the attachment's figure. And you can find
>> >> AsterixDB's time is always better than others  no matter good or bad
>> >> order:-) (BTW, the y scale of figure is time in log scale. You can see
>> the
>> >> time by the label of each bar.)
>> >>
>> >> Thanks for your help.
>> >>
>> >> Bests,
>> >> Mingda
>> >>
>> >>
>> >>
>> >
>>
>
>

Re: Time of Multiple Joins in AsterixDB

Reply via email to