Now that we've got substantial experience with topic-based PubSubHubbub, I
think it is time we began considering how we would extend the current
protocol to support content-based filtering of topics... We've casually
discussed this in the past, I suggest that we address the subject seriously.
In a topic based subscription system, a subscriber receives all messages
published to any topic to which they have subscribed.
With content-based filtering, the subscriber is able to associate one or
more filters with a topic and will then only receive those messages which
are published to the topic and which match at least one of the associated
filters. Thus, given a topic based on a feed of weather-related
announcements, I might create a subscription with a filter that indicates
that I'm only interested in messages that are "Storm Alerts" and that are
intended for the New York City area. I might also subscribe to "Air
Pollution Alerts" in the same area. Given such subscriptions, I would only
receive those messages published to the topic that matched one of my filters
even if there were many orders of magnitude more messages actually published
to the topic.
I would like to suggest the following extensions to support content-based
filtering:
- In "6.1. Subscriber Sends Subscription Request" Add an optional
"hub.filter" which would contain a textual filter expression to be
specified in mode "subscribe". (syntax to be discussed later).
- In "6.1 Subscriber Sends Subscription Request" Add an optional
"hub.filterid" which would be used in mode "unsubscribe".
- In "6.1.2 Subscription Response Details": The hub's response should
include the header field X-Hub-FilterId which would contain a unique id for
the filter. I believe we should require that this be an MD5 hash of the
query.
- In "6.2 Hub Verifies Intent of the Subscriber" Include hub.filter in
the request and provide X-Hub-FilterId in the header
- In "7.3 Content Distribution" For each subscriber/callback-url that had
registered a filter on a topic, a single POST would be generated -- as long
as one or more of those filters matched published content. In each POST
request from hub to subscriber's callback URL, there would be an
X-Hub-FilterId field that lists all of the filter-ids for filters that match
the published data.
Also, when processing feeds which contain more than one entry or item,
the feeds should be "broken up" so that content-based matching and thus
distribution is done against only one entry or item at a time.
- In "2. Definitions" Add a definitions for "filter" and filter-id.
Q&A: (In which I try to anticipate the obvious questions...)
- Why not put the filter query in the Topic URL and eliminate the
hub.filter?
- Any attempt to use the Topic URL would put constraints on those
URLs.
- We already have PSHB subscriptions to URLs which contain query
parameters and thus it would be hard to distinguish PSHB filter
parameters
from query parameters.
- Using a distinct URL for each subscription would result in a
distinct delivery for each matching message. Thus, if you had
thousands of
distinct filters on a single topic, you would receive thousands of
deliveries.
- Embedding the filter in the query will inevitably lead to it being
transmitted over the network repeatedly as well as being logged
in various
log files. However, filter queries often contain very sensitive
information
whose disclosure could compromise either privacy or security.
Not including
it in the URL somewhat reduces the opportunity for inadvertently
disclosing
or logging filter queries.
- Why use FilterId instead of just using the actual query?
- In many cases, users' filters will expose data that is private or
sensitive. If filters are frequently transmitted between
machines, they are
more likely to be intercepted by malicious or unintended intermediaries.
Privacy and security are enhanced by only passing the actual filter text
when it is absolutely necessary.
- Using FilterId for delivery and for unsubscribe means that the
actual query is never exposed except during the initial subscription
exchange. Thus, it is possible to implement a "secure" system
for creating
subscriptions and a somewhat less secure system for distributing the
results.
- While queries will initially probably be very simple (simple
keywords), we can expect that queries will become more complex over time.
Thus, the length of a query can become very large. Using a FilterId makes
handling these things much simpler.
- While using the query would simplify debugging, the reality is that
running systems will be more easily coded if they can rely on the ids of
matched filters being of fixed size and numeric -- to facilitate use as
database keys, etc.
- Why list all FilterIds with the delivered data?
- Doing so makes it possible to deliver a single response that covers
a potentially large number of filters. A single feed update might, for a
shared subscriber, match 10, 100's, even thousands of filters.
In the case
of many filter matches, delivering a unique response for each
filter would
waste vast amounts of network and processing resources.
- Note: Subscribers who want a distinct response for each matched
filter would just subscribe each filter with a distinct call-back URL.
- Why use an MD5 hash for the FilterId?
- MD5 is commonly available on all platforms and in all useful
languages and it is easy and cheap to compute.
- If users are free to provide their own FilterIds, they will
inevitably abuse the system and use very long ids that tax the system.
- MD5 hashes are of known sizes and properties and are system
independent. Thus, using them makes it easier to design, implement and
deploy across systems, develop open-source implementations, etc..
- Why not have the subscriber pass the FilterId to the hub?
- If the FilterId is a simple MD5 hash, it is not necessary for the
subscriber to pass it in.
- If the subscriber can pass in the FilterId, then they will
inevitably abuse this -- as discussed above.
- Why return the FilterId as a header on the Subscription Response?
- To ensure that both Server and Subscriber agree on the FilterId
- To accommodate those coders who can't figure out how to compute MD5
hashes... (i.e. don't compute it, just pull it out of the response.)
- Why use hub.filterid for unsubscribing instead of hub.filter?
- In some cases, the subscriber will have "forgotten" the mapping from
filter to FilterId but may still be getting results for a previous
subscription. In order to unsubscribe, they need to provide the
FilterId to
unsubscribe than thus stop the unexpected messages.
- Why not specify the filter syntax now?
- I believe the filter syntax discussion is orthogonal to the
discussion of the framework for registering subscriptions and for getting
responses.
- Ideally, folk would use some reasonable subset of XPath filtering.
However, that might be overwhelming for some.
- What about firehose feeds?
- I'm suggesting that this extension apply to all feeds, not just to
the still-undefined firehose feeds even though content-based
filtering will
be probably be most useful when, in fact, used with firehose feeds. If we
assume that a firehose feed is identified by a topic (perhaps a
".well-known" topic...) then what we defined for normal topics
should also
apply to firehose feeds.
I would appreciate any comments that you might have. I think that the
proposal here is about as simple as it can get without introducing serious
bandwidth or processing loads on servers and/or subscribers.
bob wyman