kragen
Wed, 20 Oct 2004 00:40:41 -0700
This is a snapshot of <http://wiki.commerce.net/tiki-index.php?page=UrlHistoryBloomFilters>. (A bit rough and ready; sorry!) Social recommendation networks ------------------------------ Often I'm interested in the same web pages my friends read. Sometimes, reading their blogs tells me what they read; http://del.icio.us/inbox/kragen helps, too. However, those approaches take some work per page: you have to explicitly post a link to each page to tell me you liked it. It's nearly the same amount of work as telling me about cool web pages when we're hanging out at the water cooler. So we never publish links to the majority of pages we visit, so our friends never find out we liked them. Collaborative filtering: statistical recommendations ---------------------------------------------------- Several services, such as Alexa's "what's related" service <http://wp.netscape.com/escapes/related/faq.html> and Amazon's "related books" service, take advantage of a central point of control in a system to track many people's reading. Then, they predict what I will be interested in, from statistical profiles of people like me. These services provide many of the benefits of social recommendation networks, but they don't require as much effort on the part of the users. They are known as collaborative filtering <http://pespmc1.vub.ac.be/COLLFILT.html> systems. Cobrowsing: publishing your URL history --------------------------------------- In 1997, in a web page entitled Communal Web Browsing <http://pobox.com/~kragen/community-web.html>, I suggested that you could discourage children from browsing porn sites by just publishing a list of what they were browsing in real time where their parents or other community members could watch; this avoids the problems Solid Oak Software has so vividly illustrated with censorware <http://www.peacefire.org/archives/latimes.on.cybersitter.txt>. Off-the-shelf software such as VNC <http://www.tightvnc.com/>, BackOrifice <http://www.bo2k.com/>, and Groove <http://www.groove.net/> support sharing of a web browser. Most Americans, however, would prefer not to publicize their web-browsing habits indiscriminately. Even public-access computer terminals in public libraries commonly have some sort of privacy shield on the screen to restrict viewing to the person sitting at the machine. If I were to build a piece of software that published every URL I visited on the Web, my friends could easily see which pages I had visited, with no extra effort on my part. If my friends also installed this software, I could see which pages had been most frequently visited by my friends --- presumably the most popular ones would also be of interest to me. My friends might not be willing to install that software, though, because they might prefer a greater measure of privacy, such as that granted by software that requires an explicit action to publish each link. Bloom filters ------------- There are these things called Bloom filters <http://www.cap-lore.com/code/BloomTheory.html>; they're basically compact, lossy representations of sets of hash values. You feed a Bloom filter a hash, and it returns "yes" or "no". "Yes" means that the set it represents it may or may not contain the hash value; "no" means that it definitely does not. The probability of a "yes" for a value it doesn't contain is adjustable, and it's lower for larger Bloom filters. Schachter and Ceglowski's LOAF <http://loaf.cantbedone.org/>, a pun on FOAF <http://xmlns.com/foaf/0.1/>, uses Bloom filters to help you prioritize your email by letting you recognize which incoming email comes from your friends' correspondents --- with an adjustable probability of a false positive. If that probability were 50%, then you'd expect a new correspondent to appear to be a correspondent of about 10 out of 20 friends whose LOAF files you have. The probability is about 1% that, by chance, they will appear to be a correspondent of more than 15, and only about 5% that they will appear to be a correspondent of more than 13. But if they actually are a correspondent of 7 out of of your 20 friends, the probability is about 50% that they will appear to be a correspondent of more than 13. As the number of your LOAF-using correspondents increases, you can reliably detect a much smaller proportion of correspondents. This protects the privacy of your friends' address books, since you can't tell for sure whether any particular friend has an address in their address book --- because of the false positive probability. But it still gives you an aggregate answer about your community's familiarity with any particular email address. Decentralized collaborative filtering: send your friends Bloom filters of URLs ------------------------------------------------------------------------------ Suppose we apply this to URLs instead. Each day I'll publish an updated Bloom filter containing all the URLs I've seen in the last few weeks, with a relatively high false-positive rate. Every few weeks, I'll clear it out, and new URLs will accumulate for a few weeks. Now all my friends have a hint, whenever they see a URL, whether or not I also saw it. They can preferentially follow links that several of their friends thought were interesting. Because I actually only visit a few hundred URLs per week, even a relatively low false-positive rate suffices to protect my privacy. If there are four billion URLs on the public web, and my Bloom filter's false-positive rate is 0.1%, then four million URLs match the filter without me having visited them (in the last few weeks), while only a few hundred match it because I have visited them. So the fact that a particular URL matches my filter is pretty weak evidence that I have actually visited it. Practically speaking, a false-positive rate closer to 10% might be better, since some URLs are much more popular than others, so the prior probability might be high enough to matter. A URL-recommending engine can accumulate these different Bloom filters over time and compare them against my browser history to see which ones are most closely correlated to it, in order to figure out which ones are the best predictors of where I want to go next. A naive Bayesian reasoner could simply compute the probability, given that a URL does or does not appear in a particular Bloom filter, that it appears in my history; then it can adjust all these probabilities together to come up with a predicted interest level. A smarter engine might discover pairwise or even more complex dependencies between the filters and discount those results --- if Adam and Rohit always go to the same URLs, I'd like to not count their two filters as if they were independent. The question remains how to acquire URLs to recommend in the first place. You can't simply pull them out of the filters --- that's the point. You could look at the stream of links in HTML pages I look at, or the stream of links in a bunch of RSS feeds, or the output of Technorati, del.icio.us, Syndic8, or pubsub.com, or my email, IRC, and IM channels. Then you have the question of how to present the results. You could display them on an HTML page, sorted with the most relevant links first, or you could rewrite HTML pages on the way in to my browser with icons or colors representing popularity. Related topics --------------- Other useful things to do with a big pile of URL Bloom filters: * given a seed URL and a list of candidate URLs, order the candidate URLs according to their co-occurrence with the seed URL. * compute the pairwise correlations of the filters, perhaps with a list of URLs and perhaps without. * cluster a list of URLs into clusters that tend to co-occur in the same filters --- for example, to find the "principal components" of a pile of search engine results, so you can present a few URLs from each "component".