Hi Ted,

Thank you so much for quick reply and it is very informative. Kindly find my 
reply below. 

Regarding deviation, Yes. Like, we can compute the deviation by computing the 
difference of new score with past score (which is mean of past few days of 
data).

Regarding variations on week boundaries having less impact, I am sorry. I 
couldn't get how it will be less impact?. I am assuming that week boundaries 
means the mean of past 7 days of data?

In order to make sure that I understood correctly, I am reiterating what you 
said. So, we compute the occurrences for all items and then rank top 10k or 
100k (Some N items). And, the rest would be given rank as N+1. Now, we compute 
the score for each item as log (r_new/r_old) (or is it log(r_old/r_new) as you 
specified below?). r_old being the rank for previous time frame (week, day or 
hour etc). Now, we sort based on scores and we can get the items which are 
varying heavily. Kindly correct me if I am wrong.

Also, when you mean k, is it number of occurrences?

Thanks
Pallavi

-----Original Message-----
From: Ted Dunning [mailto:[email protected]] 
Sent: Thursday, September 24, 2009 10:55 AM
To: [email protected]
Subject: Re: Trending and prediction Analysis

When you say deviation, do you mean deviation in rate of something that is
nominally Poisson distributed (after accounting for diurnal and weekly
variations)?

If you can do the computation on week boundaries, then these variations have
less effect.

Alternatively, if you base the computation on rank, it is also much easier.

For the rank-based computation, if you assume Zipfian distribution, then the
cumulative distribution is close to log r (r = rank).  This is typically
accurate for common web applications except for the top 1% or so of items.
This implies that the quantity

         score = log (r_old / r_new)

is a good measure of the significance of the change for an item.  This is
close to what we used at Veoh with very good results.  We tried a number of
more complicated approaches and this was one of the best.  Certainly, it was
the best simple measure.   You could equivalently use log(k_new / k_old),
but using ranks self corrects for most frequency variations which is really
nice.

If all this is good enough for you, then you don't need anything fancy at
all.  Just count occurrences using any technique you like, rank the top
umpteen of these (10,000 to 100,000 is plenty for most purposes).  Then
compare the scores from period to period.

One gotcha is how to handle items that are brand new.  The easiest answer is
to give them a rank just beyond anything you retain counts for.  Thus, if
you keep 10,000 counts, any item that doesn't appear is given a rank of
10,001.

Was this even what you wanted?

On Wed, Sep 23, 2009 at 9:19 PM, Palleti, Pallavi <
[email protected]> wrote:

> Hi Ted,
>
> I am especially looking for efficient algorithms for computing trends
> (Similar to Google's hot-trends). The simple approach to start with it is to
> compute the deviation of item(URL/query) from its current score. I'm
> exploring if I can do the same more efficiently and effectively.
>
> Thanks
> Pallavi
>
> -----Original Message-----
> From: Ted Dunning [mailto:[email protected]]
> Sent: Wednesday, September 23, 2009 10:29 PM
> To: [email protected]
> Subject: Re: Trending and prediction Analysis
>
> Can you say just a bit more about what you want?
>
> On Wed, Sep 23, 2009 at 9:45 AM, Palleti, Pallavi <
> [email protected]> wrote:
>
> > Hi all,
> >
> > I am currently exploring trending and prediction analysis algorithms.
> > Kindly refer any good work/paper that you have come across.
> >
> > Thanks
> > Pallavi
> >
> >
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to