BTW: For the 1-10G figure's label, 'W' means wrong order while 'R' means right order. We always put the wrong order join left side to right order join. And for the 1-100G figure's, we didn't add labels but the wrong order join is also in left side.
On Tue, Dec 20, 2016 at 5:44 PM, mingda li <[email protected]> wrote: > Oh, sure. When we test the 100G multiple join, we find AsterixDB is slower > than Spark (but still faster than Pig and Hive). > I can share with you the both plots: 1-10G.eps and 1-100G.eps. (We will > only use 1-10G.eps in our paper). > And thanks for Ian's advice:* The dev list generally strips attachments. > Maybe you can just put the config inline? Or link to a pastebin/gist?* > I know why you can't see the attachments. So I move the plots with two > documents to my Dropbox. > You can find the > 1-10G.eps here: https://www.dropbox.com/s/rk3xg6gigsfcuyq/1-10G.eps?dl=0 > 1-100G.eps here:https://www.dropbox.com/s/tyxnmt6ehau2ski/1-100G.eps?dl=0 > cc_conf.pdf here: https://www.dropbox.com/s/y3of1s17qdstv5f/cc_conf.pdf? > dl=0 > CompleteQuery.pdf here: https://www.dropbox.com/s/lml3fzxfjcmf2c1/ > CompleteQuery.pdf?dl=0 > > On Tue, Dec 20, 2016 at 4:40 PM, Tyson Condie <[email protected]> > wrote: > >> Mingda: Please also share the numbers for 100GB, which show AsterixDB not >> quite doing as well as Spark. These 100GB results will not be in our >> submission version, since they’re not needed for the desired message: >> picking the right join order matters. Nevertheless, I’d like to get a >> better understanding of what’s going on in the larger dataset regime. >> >> >> >> -Tyson >> >> >> >> From: Yingyi Bu [mailto:[email protected]] >> Sent: Tuesday, December 20, 2016 4:30 PM >> To: [email protected] >> Cc: Michael Carey <[email protected]>; Tyson Condie < >> [email protected]> >> Subject: Re: Time of Multiple Joins in AsterixDB >> >> >> >> Hi Mingda, >> >> >> >> It looks that you didn't attach the pdf? >> >> Thanks! >> >> >> >> Best, >> >> Yingyi >> >> >> >> On Tue, Dec 20, 2016 at 4:15 PM, mingda li <[email protected] >> <mailto:[email protected]> > wrote: >> >> Sorry for the wrong version of cc.conf. I convert it to pdf version as >> attachment. >> >> >> >> On Tue, Dec 20, 2016 at 4:06 PM, mingda li <[email protected] >> <mailto:[email protected]> > wrote: >> >> Dear all, >> >> >> >> I am testing different systems' (AsterixDB, Spark, Hive, Pig) multiple >> joins to see if there is a big difference with different join order. This >> is the reason for our research on multiple join and the result will apppear >> in our paper which is to be submitted to VLDB soon. Could you help us to >> make sure that the test results make sense for AsterixDB? >> >> >> >> We configure the AsterixDB 0.8.9 ( use >> asterix-server-0.8.9-SNAPSHOT-binary-assembly) >> in our cluster of 16 machines, each with a 3.40GHz i7 processor (4 cores >> and 2 hyper-threads per core), 32GB of RAM and 1TB of disk capacity. The >> operating system is 64-bit Ubuntu 12.04. JDK version 1.8.0. During >> configuration, I follow the NCService instruction here >> https://ci.apache.org/projects/asterixdb/ncservice.html. And I set the >> cc.conf as in attachment. (Each node work as nc and the first node also >> work as cc). >> >> >> >> For experiment, we use 3 fact tables from TPC-DS: inventory; >> catalog_sales; catalog_returns with TPC-DS scale factor 1g and 10g. The >> multiple join query we use in AsterixDB are as following: >> >> >> >> Good Join Order: SELECT COUNT(*) FROM (SELECT * FROM catalog_sales cs1 >> JOIN catalog_returns cr1 >> >> ON (cs1.cs_order_number = cr1.cr_order_number AND cs1.cs_item_sk = >> cr1.cr_item_sk)) m1 JOIN inventory i1 ON i1.inv_item_sk = cs1.cs_item_sk; >> >> >> >> Bad Join Order: SELECT COUNT(*) FROM (SELECT * FROM catalog_sales cs1 >> JOIN inventory i1 ON cs1.cs_item_sk = i1.inv_item_sk) m1 JOIN >> catalog_returns cr1 ON (cs1.cs_order_number = cr1.cr_order_number AND >> cs1.cs_item_sk = cr1.cr_item_sk); >> >> >> >> We load the data to AsterixDB firstly and run the two different queries. >> (The complete version of all queries for AsterixDB is in attachment) We >> assume the data has already been stored in AsterixDB and only count the >> time for multiple join. >> >> >> >> Meanwhile, we use the same dataset and query to test Spark, Pig and Hive. >> The result is shown in the attachment's figure. And you can find >> AsterixDB's time is always better than others no matter good or bad >> order:-) (BTW, the y scale of figure is time in log scale. You can see the >> time by the label of each bar.) >> >> >> >> Thanks for your help. >> >> >> >> Bests, >> >> Mingda >> >> >> >> >> >> >> >> >> >> >
