Re: Presto+CarbonData optimization work discussion

Liang Chen Wed, 19 Jul 2017 20:19:13 -0700

Hi Ravi

Thanks for your comment.


I tested again with excluding province as dictionary.  In spark, the query
time is around 3 seconds, in presto same is 9 seconds.  so for this query
case(short string), dictionary lazy decode might not be the key factor.

Regards
Liang

2017-07-20 10:56 GMT+08:00 Ravindra Pesala <ravi.pes...@gmail.com>:

> Hi Liang,
>
> I see that province column data is not big, so I guess it hardly make any
> impact with lazy decoding in this scenario. Can you do one more test by
> excluding the province from dictionary in both presto and spark
> integrations. It will tell whether it is really a lazy decoding issue or
> not.
>
> Regards,
> Ravindra
>
> On 20 July 2017 at 08:04, Liang Chen <chenliang6...@gmail.com> wrote:
>
> > Hi
> >
> > For -- 4) Lazy decoding of the dictionary,  just i tested 180 millions
> rows
> > data with the script:
> > "select province,sum(age),count(*) from presto_carbondata group by
> province
> > order by province"
> >
> > Spark integration module has "dictionary lazy decode", presto doesn't
> have
> > "dictionary lazy decode", the performance is 4.5 times difference, so
> > "dictionary lazy decode" might much help to improve aggregation
> > performance.
> >
> > The detail test result as below :
> >
> > *1. Presto+CarbonData is 9 second:*
> > presto:default> select province,sum(age),count(*) from presto_carbondata
> > group by province order by province;
> >  province |  _col1   |  _col2
> > ----------+----------+---------
> >  AB       | 57442740 | 1385010
> >  BC       | 57488826 | 1385580
> >  MB       | 57564702 | 1386510
> >  NB       | 57599520 | 1386960
> >  NL       | 57446592 | 1383774
> >  NS       | 57448734 | 1384272
> >  NT       | 57534228 | 1386936
> >  NU       | 57506844 | 1385346
> >  ON       | 57484956 | 1384470
> >  PE       | 57325164 | 1379802
> >  QC       | 57467886 | 1385076
> >  SK       | 57385152 | 1382364
> >  YT       | 57377556 | 1383900
> > (13 rows)
> >
> > Query 20170720_022833_00004_c9ky2, FINISHED, 1 node
> > Splits: 55 total, 55 done (100.00%)
> > 0:09 [18M rows, 34.3MB] [1.92M rows/s, 3.65MB/s]
> >
> > *2.Spark+CarbonData is :2 seconds*
> > scala> benchmark { carbon.sql("select province,sum(age),count(*) from
> > presto_carbondata group by province order by province").show }
> > +--------+--------+--------+
> > |province|sum(age)|count(1)|
> > +--------+--------+--------+
> > |      AB|57442740| 1385010|
> > |      BC|57488826| 1385580|
> > |      MB|57564702| 1386510|
> > |      NB|57599520| 1386960|
> > |      NL|57446592| 1383774|
> > |      NS|57448734| 1384272|
> > |      NT|57534228| 1386936|
> > |      NU|57506844| 1385346|
> > |      ON|57484956| 1384470|
> > |      PE|57325164| 1379802|
> > |      QC|57467886| 1385076|
> > |      SK|57385152| 1382364|
> > |      YT|57377556| 1383900|
> > +--------+--------+--------+
> >
> > 2109.346231ms
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-dev-
> > mailing-list-archive.1130556.n5.nabble.com/Presto-
> > CarbonData-optimization-work-discussion-tp18509p18522.html
> > Sent from the Apache CarbonData Dev Mailing List archive mailing list
> > archive at Nabble.com.
> >
>
>
>
> --
> Thanks & Regards,
> Ravi
>

Re: Presto+CarbonData optimization work discussion

Reply via email to