kragen-tol  

URL history Bloom filters for collaborative filtering

kragen
Wed, 20 Oct 2004 00:40:41 -0700

This is a snapshot of 
<http://wiki.commerce.net/tiki-index.php?page=UrlHistoryBloomFilters>.

(A bit rough and ready; sorry!)

Social recommendation networks
------------------------------

Often I'm interested in the same web pages my friends read.
Sometimes, reading their blogs tells me what they read;
http://del.icio.us/inbox/kragen helps, too.  However, those approaches
take some work per page: you have to explicitly post a link to each
page to tell me you liked it.  It's nearly the same amount of work as
telling me about cool web pages when we're hanging out at the water
cooler.  So we never publish links to the majority of pages we visit,
so our friends never find out we liked them.

Collaborative filtering: statistical recommendations
----------------------------------------------------

Several services, such as Alexa's "what's related" service
<http://wp.netscape.com/escapes/related/faq.html> and Amazon's
"related books" service, take advantage of a central point of control
in a system to track many people's reading.  Then, they predict what I
will be interested in, from statistical profiles of people like me.
These services provide many of the benefits of social recommendation
networks, but they don't require as much effort on the part of the
users.  They are known as collaborative filtering
<http://pespmc1.vub.ac.be/COLLFILT.html> systems.

Cobrowsing: publishing your URL history
---------------------------------------

In 1997, in a web page entitled Communal Web Browsing
<http://pobox.com/~kragen/community-web.html>, I suggested that you
could discourage children from browsing porn sites by just publishing
a list of what they were browsing in real time where their parents or
other community members could watch; this avoids the problems Solid
Oak Software has so vividly illustrated with censorware
<http://www.peacefire.org/archives/latimes.on.cybersitter.txt>.
Off-the-shelf software such as VNC <http://www.tightvnc.com/>,
BackOrifice <http://www.bo2k.com/>, and Groove
<http://www.groove.net/> support sharing of a web browser.  Most
Americans, however, would prefer not to publicize their web-browsing
habits indiscriminately.  Even public-access computer terminals in
public libraries commonly have some sort of privacy shield on the
screen to restrict viewing to the person sitting at the machine.

If I were to build a piece of software that published every URL I
visited on the Web, my friends could easily see which pages I had
visited, with no extra effort on my part.  If my friends also
installed this software, I could see which pages had been most
frequently visited by my friends --- presumably the most popular ones
would also be of interest to me.  My friends might not be willing to
install that software, though, because they might prefer a greater
measure of privacy, such as that granted by software that requires an
explicit action to publish each link.

Bloom filters
-------------

There are these things called Bloom filters
<http://www.cap-lore.com/code/BloomTheory.html>; they're basically
compact, lossy representations of sets of hash values.  You feed a
Bloom filter a hash, and it returns "yes" or "no".  "Yes" means that
the set it represents it may or may not contain the hash value; "no"
means that it definitely does not.  The probability of a "yes" for a
value it doesn't contain is adjustable, and it's lower for larger
Bloom filters.

Schachter and Ceglowski's LOAF <http://loaf.cantbedone.org/>, a pun on
FOAF <http://xmlns.com/foaf/0.1/>, uses Bloom filters to help you
prioritize your email by letting you recognize which incoming email
comes from your friends' correspondents --- with an adjustable
probability of a false positive.  If that probability were 50%, then
you'd expect a new correspondent to appear to be a correspondent of
about 10 out of 20 friends whose LOAF files you have.  The probability
is about 1% that, by chance, they will appear to be a correspondent of
more than 15, and only about 5% that they will appear to be a
correspondent of more than 13.  But if they actually are a
correspondent of 7 out of of your 20 friends, the probability is about
50% that they will appear to be a correspondent of more than 13.

As the number of your LOAF-using correspondents increases, you can
reliably detect a much smaller proportion of correspondents.

This protects the privacy of your friends' address books, since you
can't tell for sure whether any particular friend has an address in
their address book --- because of the false positive probability.  But
it still gives you an aggregate answer about your community's
familiarity with any particular email address.

Decentralized collaborative filtering: send your friends Bloom filters of URLs
------------------------------------------------------------------------------

Suppose we apply this to URLs instead.  Each day I'll publish an
updated Bloom filter containing all the URLs I've seen in the last few
weeks, with a relatively high false-positive rate.  Every few weeks,
I'll clear it out, and new URLs will accumulate for a few weeks.  Now
all my friends have a hint, whenever they see a URL, whether or not I
also saw it.  They can preferentially follow links that several of
their friends thought were interesting.

Because I actually only visit a few hundred URLs per week, even a
relatively low false-positive rate suffices to protect my privacy.  If
there are four billion URLs on the public web, and my Bloom filter's
false-positive rate is 0.1%, then four million URLs match the filter
without me having visited them (in the last few weeks), while only a
few hundred match it because I have visited them.  So the fact that a
particular URL matches my filter is pretty weak evidence that I have
actually visited it.  Practically speaking, a false-positive rate
closer to 10% might be better, since some URLs are much more popular
than others, so the prior probability might be high enough to matter.

A URL-recommending engine can accumulate these different Bloom filters
over time and compare them against my browser history to see which
ones are most closely correlated to it, in order to figure out which
ones are the best predictors of where I want to go next.  A naive
Bayesian reasoner could simply compute the probability, given that a
URL does or does not appear in a particular Bloom filter, that it
appears in my history; then it can adjust all these probabilities
together to come up with a predicted interest level.  A smarter engine
might discover pairwise or even more complex dependencies between the
filters and discount those results --- if Adam and Rohit always go to
the same URLs, I'd like to not count their two filters as if they were
independent.

The question remains how to acquire URLs to recommend in the first
place.  You can't simply pull them out of the filters --- that's the
point.  You could look at the stream of links in HTML pages I look at,
or the stream of links in a bunch of RSS feeds, or the output of
Technorati, del.icio.us, Syndic8, or pubsub.com, or my email, IRC, and
IM channels.

Then you have the question of how to present the results.  You could
display them on an HTML page, sorted with the most relevant links
first, or you could rewrite HTML pages on the way in to my browser
with icons or colors representing popularity.

Related topics
---------------

Other useful things to do with a big pile of URL Bloom filters:
* given a seed URL and a list of candidate URLs, order the candidate
  URLs according to their co-occurrence with the seed URL.
* compute the pairwise correlations of the filters, perhaps with a
  list of URLs and perhaps without.
* cluster a list of URLs into clusters that tend to co-occur in the
  same filters --- for example, to find the "principal components" of a
  pile of search engine results, so you can present a few URLs from each
  "component".
  • URL history Bloom filters for collaborative filtering kragen