Apache Wiki
Fri, 11 Jan 2008 07:51:26 -0800
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.
The following page has been changed by udanax: http://wiki.apache.org/lucene-hadoop/NewsPersonalizationSystem New page: [[TableOfContents(4)]] ---- = News Personalization System = == Initial Contributors == * [:udanax:Edward Yoon] (R&D center, NHN corp.) == Algorithm Overview == * Obtain a list of candidate stories * For each story: * Obtain 3 scores (y1, y2, y3) * Final score = â wi*yi * User clustering algorithms (Minhash (y1), PLSI (y2)) * Score ~ number of times this story was clicked by other users in your cluster * Story-story covisitation (y3) * Score ~ number of times this story was co-clicked with other stories in your recent click history * User clustering done offline as batch process * Can be run every day or 2-3 times a week * Cluster-story counts maintained in real time * New users * May not be clustered * Rely on co-visitation to generate recommendations ---- == NPS Architecture == {{{ +----------------------------+ | News Frontend Web server | +-----+---------------+------+ | â | Rank Ranks Click Request Reply Notify â | â +--------------------------+--------------------+ | Personalization Servers | +----------------------------------------+------+ â â | Reads user profile Read Stats Update Stats | | â +-------------------------+-----+ +------------+-------------------+ | Hbase:UserTable | | Hbase:StoryTable | | (user clusters, click history | | (Cluster + covisitation count) | +-------------------------------+ +--------------------------------+ â +-----------------------------+-------------------------------------------+ | Hadoop:User clustering MapReduce | +-------------------------------------------------------------------------+ }}} ---- == Clustering Algorithms == === User clustering - MinHash === * Input: User and his clicked stories * ~+S+~,,u,, = {s^u^,,1,, , s^u^,,2,, , ... , s^u^,,m} * User similarity = | S,,u1,, I S,,u2 | / |S,,u1, Y S,,u2,, | * Output: User clusters. * Similar users belong to same cluster ==== MinHash ==== * Randomly permute the universe of clicked stories * {s^u^,,1,, , s^u^,,2,, , ... , s^u^,,m} = {s^'^,,1,, , s^'^,,2,, , ... , s^'^,,m} * MH(u) = min(s^u^,,j,,) min defined by permutation * P{MH(u,,1,,) = MH(u,,2,,)} = | S,,u1,, I S,,u2 | / |S,,u1, Y S,,u2,, | * Pseudo-random permutation * Compute hash for each story and treat hash-value as permutation index * Treat !MinHash value as !ClusterId * Probabilistic clustering === Clustering - PLSI Algorithm === * Learning (done offline) * ML estimation * Learn model parameters that maximize the likelihood of the sample data * Ouput = P[zj|u]âs P[s|zj]âs * P[zj|u]âs lead to a soft clustering of users * Runtime: we only use P[zj|u]âs and ignore P[s|zj]âs === Covisitation count === * For each story si store the covisitation counts with other stories c(si, sj ) * Candidate story: sk * User history: s1,â¦, sn * score (si, sj ) = c(si, sj )/âm c(si, sm ) * total_score(sk) = ân score(sk, sn ) ---- == References == * Google News Personalization: Scalable Online Collaborative Filtering * Bigtable paper: OSDI 2006