Another, more traditional approach is to group by user id, sort by time. Then you can slide through a single users transactions emitting pairs of items that occur in the same window. Windowed co-occurrence is a bit of a strange beast because it isn't transitive (A can cooccur with B and B with C while not having A with C).
The problem with what you propose is that users are likely to often come in for about 5 minutes. Using 5 minute windows that don't slide will substantially decrease the number of cooccur. It should also work well if you use a very large window such as 2 hours and slide using that or in the extreme, just group on user and ignore time. The defects in extreme solutions is that the downstream algorithms have to be better at handling more data (potentially roughly quadratic in window size if all users are active all the time) and better at handling noise due to attention span issues. On Sun, Aug 2, 2009 at 3:51 AM, Robin Anil <[email protected]> wrote: > What I am thinking is given a 5 minute window in time for a given user, > group all the queries (if they are unique choose only one) and call that as > a transaction for PFPGrowth. > -- Ted Dunning, CTO DeepDyve
