TopN is a common analysis requirement. Kylin does support 'order by' and 'limit' currently, but it is online calculation. In order to greatly speed up TopN query, we can pre-calculate TopN and store it like a metrics.
I'm selecting algorithm at the moment. Don't require exact TopN result. Approximate algorithms with expected error are a lot more space saving and faster. My options are the below. 1. SpaceSaving [1] and its parallel version [2] 2. CountSketch [3] Seems SpaceSaving has both smaller footprint and is faster to calculate according to [1]. Does any one has experience could confirm? Or any suggestion on alternative algorithms? Cheers Yang [1] https://icmi.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf [2] http://arxiv.org/pdf/1401.0702.pdf [3] http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarCF.pdf
