Sorry for the wrong version of cc.conf. I convert it to pdf version as attachment.
On Tue, Dec 20, 2016 at 4:06 PM, mingda li <[email protected]> wrote: > Dear all, > > I am testing different systems' (AsterixDB, Spark, Hive, Pig) multiple > joins to see if there is a big difference with different join order. This > is the reason for our research on multiple join and the result will apppear > in our paper which is to be submitted to VLDB soon. Could you help us to > make sure that the test results make sense for AsterixDB? > > We configure the AsterixDB 0.8.9 ( use > asterix-server-0.8.9-SNAPSHOT-binary-assembly) > in our cluster of 16 machines, each with a 3.40GHz i7 processor (4 cores > and 2 hyper-threads per core), 32GB of RAM and 1TB of disk capacity. The > operating system is 64-bit Ubuntu 12.04. JDK version 1.8.0. During > configuration, I follow the NCService instruction here > https://ci.apache.org/projects/asterixdb/ncservice.html. And I set the > cc.conf as in attachment. (Each node work as nc and the first node also > work as cc). > > For experiment, we use 3 fact tables from TPC-DS: inventory; > catalog_sales; catalog_returns with TPC-DS scale factor 1g and 10g. The > multiple join query we use in AsterixDB are as following: > > Good Join Order: *SELECT COUNT(*) FROM (SELECT * FROM catalog_sales cs1 > JOIN catalog_returns cr1* > * ON (cs1.cs_order_number = cr1.cr_order_number AND cs1.cs_item_sk = > cr1.cr_item_sk)) m1 JOIN inventory i1 ON i1.inv_item_sk = cs1.cs_item_sk;* > > Bad Join Order: *SELECT COUNT(*) FROM (SELECT * FROM catalog_sales cs1 > JOIN inventory i1 ON cs1.cs_item_sk = i1.inv_item_sk) m1 JOIN > catalog_returns cr1 ON (cs1.cs_order_number = cr1.cr_order_number AND > cs1.cs_item_sk = cr1.cr_item_sk);* > > We load the data to AsterixDB firstly and run the two different queries. > (The complete version of all queries for AsterixDB is in attachment) We > assume the data has already been stored in AsterixDB and only count the > time for multiple join. > > Meanwhile, we use the same dataset and query to test Spark, Pig and Hive. > The result is shown in the attachment's figure. And you can find > AsterixDB's time is always better than others no matter good or bad > order:-) (BTW, the y scale of figure is time in log scale. You can see the > time by the label of each bar.) > > Thanks for your help. > > Bests, > Mingda > > >
