Oh, I think you can find the figure in the first email. I can attach the three files again here. (BaselineDifferentDB.eps; cc_conf.pdf; CompleteQuery.pdf) Sorry for making any confusion.
On Tue, Dec 20, 2016 at 4:30 PM, Yingyi Bu <[email protected]> wrote: > Hi Mingda, > > It looks that you didn't attach the pdf? > Thanks! > > Best, > Yingyi > > On Tue, Dec 20, 2016 at 4:15 PM, mingda li <[email protected]> wrote: > > > Sorry for the wrong version of cc.conf. I convert it to pdf version as > > attachment. > > > > On Tue, Dec 20, 2016 at 4:06 PM, mingda li <[email protected]> > wrote: > > > >> Dear all, > >> > >> I am testing different systems' (AsterixDB, Spark, Hive, Pig) multiple > >> joins to see if there is a big difference with different join order. > This > >> is the reason for our research on multiple join and the result will > apppear > >> in our paper which is to be submitted to VLDB soon. Could you help us to > >> make sure that the test results make sense for AsterixDB? > >> > >> We configure the AsterixDB 0.8.9 ( use asterix-server-0.8.9-SNAPSHOT- > binary-assembly) > >> in our cluster of 16 machines, each with a 3.40GHz i7 processor (4 cores > >> and 2 hyper-threads per core), 32GB of RAM and 1TB of disk capacity. The > >> operating system is 64-bit Ubuntu 12.04. JDK version 1.8.0. During > >> configuration, I follow the NCService instruction here > >> https://ci.apache.org/projects/asterixdb/ncservice.html. And I set the > >> cc.conf as in attachment. (Each node work as nc and the first node also > >> work as cc). > >> > >> For experiment, we use 3 fact tables from TPC-DS: inventory; > >> catalog_sales; catalog_returns with TPC-DS scale factor 1g and 10g. The > >> multiple join query we use in AsterixDB are as following: > >> > >> Good Join Order: *SELECT COUNT(*) FROM (SELECT * FROM catalog_sales cs1 > >> JOIN catalog_returns cr1* > >> * ON (cs1.cs_order_number = cr1.cr_order_number AND cs1.cs_item_sk = > >> cr1.cr_item_sk)) m1 JOIN inventory i1 ON i1.inv_item_sk = > cs1.cs_item_sk;* > >> > >> Bad Join Order: *SELECT COUNT(*) FROM (SELECT * FROM catalog_sales cs1 > >> JOIN inventory i1 ON cs1.cs_item_sk = i1.inv_item_sk) m1 JOIN > >> catalog_returns cr1 ON (cs1.cs_order_number = cr1.cr_order_number AND > >> cs1.cs_item_sk = cr1.cr_item_sk);* > >> > >> We load the data to AsterixDB firstly and run the two different queries. > >> (The complete version of all queries for AsterixDB is in attachment) We > >> assume the data has already been stored in AsterixDB and only count the > >> time for multiple join. > >> > >> Meanwhile, we use the same dataset and query to test Spark, Pig and > Hive. > >> The result is shown in the attachment's figure. And you can find > >> AsterixDB's time is always better than others no matter good or bad > >> order:-) (BTW, the y scale of figure is time in log scale. You can see > the > >> time by the label of each bar.) > >> > >> Thanks for your help. > >> > >> Bests, > >> Mingda > >> > >> > >> > > >
