Now that we've got substantial experience with topic-based PubSubHubbub, I
think it is time we began considering how we would extend the current
protocol to support content-based filtering of topics... We've casually
discussed this in the past, I suggest that we address the subject seriously.

In a topic based subscription system, a subscriber receives all messages
published to any topic to which they have subscribed.
With content-based filtering, the subscriber is able to associate one or
more filters with a topic and will then only receive those messages which
are published to the topic and which match at least one of the associated
filters. Thus, given a topic based on a feed of weather-related
announcements, I might create a subscription with a filter that indicates
that I'm only interested in messages that are "Storm Alerts" and that are
intended for the New York City area. I might also subscribe to "Air
Pollution Alerts" in the same area. Given such subscriptions, I would only
receive those messages published to the topic that matched one of my filters
even if there were many orders of magnitude more messages actually published
to the topic.

I would like to suggest the following extensions to support content-based
filtering:

   - In "6.1. Subscriber Sends Subscription Request" Add an optional
   "hub.filter" which would contain a textual filter expression  to be
   specified in mode "subscribe". (syntax to be discussed later).
   - In "6.1 Subscriber Sends Subscription Request" Add an optional
   "hub.filterid" which would be used in mode "unsubscribe".
   - In "6.1.2 Subscription Response Details": The hub's response should
   include the header field X-Hub-FilterId which would contain a unique id for
   the filter. I believe we should require that this be an MD5 hash of the
   query.
   - In "6.2 Hub Verifies Intent of the Subscriber" Include hub.filter in
   the request and provide X-Hub-FilterId in the header
   - In "7.3 Content Distribution" For each subscriber/callback-url that had
   registered a filter on a topic, a single POST would be generated -- as long
   as one or more of those filters matched published content. In each POST
   request from hub to subscriber's callback URL, there would be an
   X-Hub-FilterId field that lists all of the filter-ids for filters that match
   the published data.
   Also, when processing feeds which contain more than one entry or item,
   the feeds should be "broken up" so that content-based matching and thus
   distribution is done against only one entry or item at a time.
   - In "2. Definitions" Add a definitions for "filter" and filter-id.


Q&A: (In which I try to anticipate the obvious questions...)

   - Why not put the filter query in the Topic URL and eliminate the
   hub.filter?
      - Any attempt to use the Topic URL would put constraints on those
      URLs.
      - We already have PSHB subscriptions to URLs which contain query
      parameters and thus it would be hard to distinguish PSHB filter
parameters
      from query parameters.
      - Using a distinct URL for each subscription would result in a
      distinct delivery for each matching message. Thus, if you had
thousands of
      distinct filters on a single topic, you would receive thousands of
      deliveries.
      - Embedding the filter in the query will inevitably lead to it being
      transmitted over the network repeatedly as well as being logged
in various
      log files. However, filter queries often contain very sensitive
information
      whose disclosure could compromise either privacy or security.
Not including
      it in the URL somewhat reduces the opportunity for inadvertently
disclosing
      or logging filter queries.
   - Why use FilterId instead of just using the actual query?
      - In many cases, users' filters will expose data that is private or
      sensitive. If filters are frequently transmitted between
machines, they are
      more likely to be intercepted by malicious or unintended intermediaries.
      Privacy and security are enhanced by only passing the actual filter text
      when it is absolutely necessary.
      - Using FilterId for delivery and for unsubscribe means that the
      actual query is never exposed except during the initial subscription
      exchange. Thus, it is possible to implement a "secure" system
for creating
      subscriptions and a somewhat less secure system for distributing the
      results.
      - While queries will initially probably be very simple (simple
      keywords), we can expect that queries will become more complex over time.
      Thus, the length of a query can become very large. Using a FilterId makes
      handling these things much simpler.
      - While using the query would simplify debugging, the reality is that
      running systems will be more easily coded if they can rely on the ids of
      matched filters being of fixed size and numeric -- to facilitate use as
      database keys, etc.
   - Why list all FilterIds with the delivered data?
      - Doing so makes it possible to deliver a single response that covers
      a potentially large number of filters. A single feed update might, for a
      shared subscriber, match 10, 100's, even thousands of filters.
In the case
      of many filter matches, delivering a unique response for each
filter would
      waste vast amounts of network and processing resources.
      - Note: Subscribers who want a distinct response for each matched
      filter would just subscribe each filter with a distinct call-back URL.
   - Why use an MD5 hash for the FilterId?
      - MD5 is commonly available on all platforms and in all useful
      languages and it is easy and cheap to compute.
      - If users are free to provide their own FilterIds, they will
      inevitably abuse the system and use very long ids that tax the system.
      - MD5 hashes are of known sizes and properties and are system
      independent. Thus, using them makes it easier to design, implement and
      deploy across systems, develop open-source implementations, etc..
   - Why not have the subscriber pass the FilterId to the hub?
      - If the FilterId is a simple MD5 hash, it is not necessary for the
      subscriber to pass it in.
      - If the subscriber can pass in the FilterId, then they will
      inevitably abuse this -- as discussed above.
   - Why return the FilterId as a header on the Subscription Response?
      - To ensure that both Server and Subscriber agree on the FilterId
      - To accommodate those coders who can't figure out how to compute MD5
      hashes... (i.e. don't compute it, just pull it out of the response.)
   - Why use hub.filterid for unsubscribing instead of hub.filter?
      - In some cases, the subscriber will have "forgotten" the mapping from
      filter to FilterId but may still be getting results for a previous
      subscription. In order to unsubscribe, they need to provide the
FilterId to
      unsubscribe than thus stop the unexpected messages.
   - Why not specify the filter syntax now?
      - I believe the filter syntax discussion is orthogonal to the
      discussion of the framework for registering subscriptions and for getting
      responses.
      - Ideally, folk would use some reasonable subset of XPath filtering.
      However, that might be overwhelming for some.
   - What about firehose feeds?
      - I'm suggesting that this extension apply to all feeds, not just to
      the still-undefined firehose feeds even though content-based
filtering will
      be probably be most useful when, in fact, used with firehose feeds. If we
      assume that a firehose feed is identified by a topic (perhaps a
      ".well-known" topic...) then what we defined for normal topics
should also
      apply to firehose feeds.

I would appreciate any comments that you might have. I think that the
proposal here is about as simple as it can get without introducing serious
bandwidth or processing loads on servers and/or subscribers.

bob wyman

Reply via email to