Mingda: Please also share the numbers for 100GB, which show AsterixDB not quite doing as well as Spark. These 100GB results will not be in our submission version, since they’re not needed for the desired message: picking the right join order matters. Nevertheless, I’d like to get a better understanding of what’s going on in the larger dataset regime.
-Tyson From: Yingyi Bu [mailto:[email protected]] Sent: Tuesday, December 20, 2016 4:30 PM To: [email protected] Cc: Michael Carey <[email protected]>; Tyson Condie <[email protected]> Subject: Re: Time of Multiple Joins in AsterixDB Hi Mingda, It looks that you didn't attach the pdf? Thanks! Best, Yingyi On Tue, Dec 20, 2016 at 4:15 PM, mingda li <[email protected] <mailto:[email protected]> > wrote: Sorry for the wrong version of cc.conf. I convert it to pdf version as attachment. On Tue, Dec 20, 2016 at 4:06 PM, mingda li <[email protected] <mailto:[email protected]> > wrote: Dear all, I am testing different systems' (AsterixDB, Spark, Hive, Pig) multiple joins to see if there is a big difference with different join order. This is the reason for our research on multiple join and the result will apppear in our paper which is to be submitted to VLDB soon. Could you help us to make sure that the test results make sense for AsterixDB? We configure the AsterixDB 0.8.9 ( use asterix-server-0.8.9-SNAPSHOT-binary-assembly) in our cluster of 16 machines, each with a 3.40GHz i7 processor (4 cores and 2 hyper-threads per core), 32GB of RAM and 1TB of disk capacity. The operating system is 64-bit Ubuntu 12.04. JDK version 1.8.0. During configuration, I follow the NCService instruction here https://ci.apache.org/projects/asterixdb/ncservice.html. And I set the cc.conf as in attachment. (Each node work as nc and the first node also work as cc). For experiment, we use 3 fact tables from TPC-DS: inventory; catalog_sales; catalog_returns with TPC-DS scale factor 1g and 10g. The multiple join query we use in AsterixDB are as following: Good Join Order: SELECT COUNT(*) FROM (SELECT * FROM catalog_sales cs1 JOIN catalog_returns cr1 ON (cs1.cs_order_number = cr1.cr_order_number AND cs1.cs_item_sk = cr1.cr_item_sk)) m1 JOIN inventory i1 ON i1.inv_item_sk = cs1.cs_item_sk; Bad Join Order: SELECT COUNT(*) FROM (SELECT * FROM catalog_sales cs1 JOIN inventory i1 ON cs1.cs_item_sk = i1.inv_item_sk) m1 JOIN catalog_returns cr1 ON (cs1.cs_order_number = cr1.cr_order_number AND cs1.cs_item_sk = cr1.cr_item_sk); We load the data to AsterixDB firstly and run the two different queries. (The complete version of all queries for AsterixDB is in attachment) We assume the data has already been stored in AsterixDB and only count the time for multiple join. Meanwhile, we use the same dataset and query to test Spark, Pig and Hive. The result is shown in the attachment's figure. And you can find AsterixDB's time is always better than others no matter good or bad order:-) (BTW, the y scale of figure is time in log scale. You can see the time by the label of each bar.) Thanks for your help. Bests, Mingda
