We will benchmark the SpaceSaving algorithm first and update our findings. On Mon, Aug 17, 2015 at 5:43 PM, Li Yang <[email protected]> wrote:
> TopN is a common analysis requirement. Kylin does support 'order by' and > 'limit' currently, but it is online calculation. In order to greatly speed > up TopN query, we can pre-calculate TopN and store it like a metrics. > > I'm selecting algorithm at the moment. Don't require exact TopN result. > Approximate algorithms with expected error are a lot more space saving and > faster. My options are the below. > > 1. SpaceSaving [1] and its parallel version [2] > 2. CountSketch [3] > > Seems SpaceSaving has both smaller footprint and is faster to calculate > according to [1]. Does any one has experience could confirm? Or any > suggestion on alternative algorithms? > > > Cheers > Yang > > [1] https://icmi.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf > [2] http://arxiv.org/pdf/1401.0702.pdf > [3] > http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarCF.pdf > >
