Hi

Based on pull request 1307, the latest test result as below, the performance
be improved 3 times.

presto:default> select province,sum(age),count(*) from presto_carbon_dict
group by province order by province;
 province |  _col1   |  _col2
----------+----------+---------
 AB       | 57442740 | 1385010
 BC       | 57488826 | 1385580
 MB       | 57564702 | 1386510
 NB       | 57599520 | 1386960
 NL       | 57446592 | 1383774
 NS       | 57448734 | 1384272
 NT       | 57534228 | 1386936
 NU       | 57506844 | 1385346
 ON       | 57484956 | 1384470
 PE       | 57325164 | 1379802
 QC       | 57467886 | 1385076
 SK       | 57385152 | 1382364
 YT       | 57377556 | 1383900
(13 rows)

Query 20170902_033821_00006_h6g24, FINISHED, 1 node
Splits: 50 total, 50 done (100.00%)
0:03 [18M rows, 0B] [6.62M rows/s, 0B/s]


Regards
Liang


Liang Chen wrote
> Hi
> 
> For -- 4) Lazy decoding of the dictionary,  just i tested 180 millions
> rows data with the script: 
> "select province,sum(age),count(*) from presto_carbondata group by
> province order by province"
> 
> Spark integration module has "dictionary lazy decode", presto doesn't have
> "dictionary lazy decode", the performance is 4.5 times difference, so
> "dictionary lazy decode" might much help to improve aggregation
> performance.
> 
> The detail test result as below : 
*
> 1. Presto+CarbonData is 9 second:
*
> presto:default> select province,sum(age),count(*) from presto_carbondata
> group by province order by province;
>  province |  _col1   |  _col2
> ----------+----------+---------
>  AB       | 57442740 | 1385010
>  BC       | 57488826 | 1385580
>  MB       | 57564702 | 1386510
>  NB       | 57599520 | 1386960
>  NL       | 57446592 | 1383774
>  NS       | 57448734 | 1384272
>  NT       | 57534228 | 1386936
>  NU       | 57506844 | 1385346
>  ON       | 57484956 | 1384470
>  PE       | 57325164 | 1379802
>  QC       | 57467886 | 1385076
>  SK       | 57385152 | 1382364
>  YT       | 57377556 | 1383900
> (13 rows)
> 
> Query 20170720_022833_00004_c9ky2, FINISHED, 1 node
> Splits: 55 total, 55 done (100.00%)
> 0:09 [18M rows, 34.3MB] [1.92M rows/s, 3.65MB/s]
*
> 2.Spark+CarbonData is :2 seconds
*
> scala> benchmark { carbon.sql("select province,sum(age),count(*) from
> presto_carbondata group by province order by province").show }
> +--------+--------+--------+
> |province|sum(age)|count(1)|
> +--------+--------+--------+
> |      AB|57442740| 1385010|
> |      BC|57488826| 1385580|
> |      MB|57564702| 1386510|
> |      NB|57599520| 1386960|
> |      NL|57446592| 1383774|
> |      NS|57448734| 1384272|
> |      NT|57534228| 1386936|
> |      NU|57506844| 1385346|
> |      ON|57484956| 1384470|
> |      PE|57325164| 1379802|
> |      QC|57467886| 1385076|
> |      SK|57385152| 1382364|
> |      YT|57377556| 1383900|
> +--------+--------+--------+
> 
> 2109.346231ms





--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Reply via email to