TopN is a common analysis requirement. Kylin does support 'order by' and
'limit' currently, but it is online calculation. In order to greatly speed
up TopN query, we can pre-calculate TopN and store it like a metrics.

I'm selecting algorithm at the moment. Don't require exact TopN result.
Approximate algorithms with expected error are a lot more space saving and
faster. My options are the below.

1. SpaceSaving [1] and its parallel version [2]
2. CountSketch [3]

Seems SpaceSaving has both smaller footprint and is faster to calculate
according to [1]. Does any one has experience could confirm? Or any
suggestion on alternative algorithms?


Cheers
Yang

[1] https://icmi.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf
[2] http://arxiv.org/pdf/1401.0702.pdf
[3]
http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarCF.pdf

Reply via email to