RE: Time of Multiple Joins in AsterixDB

Tyson Condie Tue, 20 Dec 2016 17:19:14 -0800

Mingda: Please also share the numbers for 100GB, which show AsterixDB not quite 
doing as well as Spark. These 100GB results will not be in our submission 
version, since they’re not needed for the desired message: picking the right 
join order matters. Nevertheless, I’d like to get a better understanding of 
what’s going on in the larger dataset regime.

-Tyson

From: Yingyi Bu [mailto:[email protected]] 
Sent: Tuesday, December 20, 2016 4:30 PM
To: [email protected]
Cc: Michael Carey <[email protected]>; Tyson Condie <[email protected]>
Subject: Re: Time of Multiple Joins in AsterixDB

Hi Mingda,

     It looks that you didn't attach the pdf?

     Thanks!

Best,

Yingyi

On Tue, Dec 20, 2016 at 4:15 PM, mingda li <[email protected] 
<mailto:[email protected]> > wrote:

Sorry for the wrong version of cc.conf. I convert it to pdf version as 
attachment.

On Tue, Dec 20, 2016 at 4:06 PM, mingda li <[email protected] 
<mailto:[email protected]> > wrote:

Dear all,

I am testing different systems' (AsterixDB, Spark, Hive, Pig) multiple joins to 
see if there is a big difference with different join order. This is the reason 
for our research on multiple join and the result will apppear in our paper 
which is to be submitted to VLDB soon. Could you help us to make sure that the 
test results make sense for AsterixDB?

We configure the AsterixDB 0.8.9 ( use 
asterix-server-0.8.9-SNAPSHOT-binary-assembly) in our cluster of 16 machines, 
each with a 3.40GHz i7 processor (4 cores and 2 hyper-threads per core), 32GB 
of RAM and 1TB of disk capacity. The operating system is 64-bit Ubuntu 12.04. 
JDK version 1.8.0. During configuration, I follow the NCService instruction 
here https://ci.apache.org/projects/asterixdb/ncservice.html. And I set the 
cc.conf as in attachment. (Each node work as nc and the first node also work as 
cc). 

For experiment, we use 3 fact tables from TPC-DS: inventory; catalog_sales; 
catalog_returns with TPC-DS scale factor 1g and 10g. The multiple join query we 
use in AsterixDB are as following:

Good Join Order: SELECT COUNT(*) FROM (SELECT * FROM catalog_sales cs1 JOIN 
catalog_returns cr1

 ON (cs1.cs_order_number = cr1.cr_order_number AND cs1.cs_item_sk = 
cr1.cr_item_sk))  m1 JOIN inventory i1 ON i1.inv_item_sk = cs1.cs_item_sk;

Bad Join Order: SELECT COUNT(*) FROM (SELECT * FROM catalog_sales cs1 JOIN 
inventory i1 ON cs1.cs_item_sk = i1.inv_item_sk) m1 JOIN catalog_returns cr1 ON 
(cs1.cs_order_number = cr1.cr_order_number AND cs1.cs_item_sk = cr1.cr_item_sk);

We load the data to AsterixDB firstly and run the two different queries. (The 
complete version of all queries for AsterixDB is in attachment)  We assume the 
data has already been stored in AsterixDB and only count the time for multiple 
join. 

Meanwhile, we use the same dataset and query to test Spark, Pig and Hive. The 
result is shown in the attachment's figure. And you can find AsterixDB's time 
is always better than others  no matter good or bad order:-) (BTW, the y scale 
of figure is time in log scale. You can see the time by the label of each bar.)

Thanks for your help.

Bests,

Mingda

RE: Time of Multiple Joins in AsterixDB

Reply via email to