Hi Ravi Thanks for your comment.
I tested again with excluding province as dictionary. In spark, the query time is around 3 seconds, in presto same is 9 seconds. so for this query case(short string), dictionary lazy decode might not be the key factor. Regards Liang 2017-07-20 10:56 GMT+08:00 Ravindra Pesala <ravi.pes...@gmail.com>: > Hi Liang, > > I see that province column data is not big, so I guess it hardly make any > impact with lazy decoding in this scenario. Can you do one more test by > excluding the province from dictionary in both presto and spark > integrations. It will tell whether it is really a lazy decoding issue or > not. > > Regards, > Ravindra > > On 20 July 2017 at 08:04, Liang Chen <chenliang6...@gmail.com> wrote: > > > Hi > > > > For -- 4) Lazy decoding of the dictionary, just i tested 180 millions > rows > > data with the script: > > "select province,sum(age),count(*) from presto_carbondata group by > province > > order by province" > > > > Spark integration module has "dictionary lazy decode", presto doesn't > have > > "dictionary lazy decode", the performance is 4.5 times difference, so > > "dictionary lazy decode" might much help to improve aggregation > > performance. > > > > The detail test result as below : > > > > *1. Presto+CarbonData is 9 second:* > > presto:default> select province,sum(age),count(*) from presto_carbondata > > group by province order by province; > > province | _col1 | _col2 > > ----------+----------+--------- > > AB | 57442740 | 1385010 > > BC | 57488826 | 1385580 > > MB | 57564702 | 1386510 > > NB | 57599520 | 1386960 > > NL | 57446592 | 1383774 > > NS | 57448734 | 1384272 > > NT | 57534228 | 1386936 > > NU | 57506844 | 1385346 > > ON | 57484956 | 1384470 > > PE | 57325164 | 1379802 > > QC | 57467886 | 1385076 > > SK | 57385152 | 1382364 > > YT | 57377556 | 1383900 > > (13 rows) > > > > Query 20170720_022833_00004_c9ky2, FINISHED, 1 node > > Splits: 55 total, 55 done (100.00%) > > 0:09 [18M rows, 34.3MB] [1.92M rows/s, 3.65MB/s] > > > > *2.Spark+CarbonData is :2 seconds* > > scala> benchmark { carbon.sql("select province,sum(age),count(*) from > > presto_carbondata group by province order by province").show } > > +--------+--------+--------+ > > |province|sum(age)|count(1)| > > +--------+--------+--------+ > > | AB|57442740| 1385010| > > | BC|57488826| 1385580| > > | MB|57564702| 1386510| > > | NB|57599520| 1386960| > > | NL|57446592| 1383774| > > | NS|57448734| 1384272| > > | NT|57534228| 1386936| > > | NU|57506844| 1385346| > > | ON|57484956| 1384470| > > | PE|57325164| 1379802| > > | QC|57467886| 1385076| > > | SK|57385152| 1382364| > > | YT|57377556| 1383900| > > +--------+--------+--------+ > > > > 2109.346231ms > > > > > > > > -- > > View this message in context: http://apache-carbondata-dev- > > mailing-list-archive.1130556.n5.nabble.com/Presto- > > CarbonData-optimization-work-discussion-tp18509p18522.html > > Sent from the Apache CarbonData Dev Mailing List archive mailing list > > archive at Nabble.com. > > > > > > -- > Thanks & Regards, > Ravi >