Re: [pubsubhubbub] Content-Based Filtering for PubSubHubbub (A proposal)

Alexis Richardson Tue, 29 Jun 2010 15:30:31 -0700

On Tue, Jun 29, 2010 at 11:09 PM, Bob Wyman <[email protected]> wrote:
> Now that we've got substantial experience with topic-based PubSubHubbub,


With all due respect - I don't think we have enough yet.  The spec is
still unstable.

alexis



> I
> think it is time we began considering how we would extend the current
> protocol to support content-based filtering of topics... We've casually
> discussed this in the past, I suggest that we address the subject seriously.
> In a topic based subscription system, a subscriber receives all messages
> published to any topic to which they have subscribed.
> With content-based filtering, the subscriber is able to associate one or
> more filters with a topic and will then only receive those messages which
> are published to the topic and which match at least one of the associated
> filters. Thus, given a topic based on a feed of weather-related
> announcements, I might create a subscription with a filter that indicates
> that I'm only interested in messages that are "Storm Alerts" and that are
> intended for the New York City area. I might also subscribe to "Air
> Pollution Alerts" in the same area. Given such subscriptions, I would only
> receive those messages published to the topic that matched one of my filters
> even if there were many orders of magnitude more messages actually published
> to the topic.
> I would like to suggest the following extensions to support content-based
> filtering:
>
> In "6.1. Subscriber Sends Subscription Request" Add an optional "hub.filter"
> which would contain a textual filter expression  to be specified in mode
> "subscribe". (syntax to be discussed later).
> In "6.1 Subscriber Sends Subscription Request" Add an optional
> "hub.filterid" which would be used in mode "unsubscribe".
> In "6.1.2 Subscription Response Details": The hub's response should include
> the header field X-Hub-FilterId which would contain a unique id for the
> filter. I believe we should require that this be an MD5 hash of the query.
> In "6.2 Hub Verifies Intent of the Subscriber" Include hub.filter in the
> request and provide X-Hub-FilterId in the header
> In "7.3 Content Distribution" For each subscriber/callback-url that had
> registered a filter on a topic, a single POST would be generated -- as long
> as one or more of those filters matched published content. In each POST
> request from hub to subscriber's callback URL, there would be an
> X-Hub-FilterId field that lists all of the filter-ids for filters that match
> the published data.
> Also, when processing feeds which contain more than one entry or item, the
> feeds should be "broken up" so that content-based matching and thus
> distribution is done against only one entry or item at a time.
> In "2. Definitions" Add a definitions for "filter" and filter-id.
>
> Q&A: (In which I try to anticipate the obvious questions...)
>
> Why not put the filter query in the Topic URL and eliminate the hub.filter?
>
> Any attempt to use the Topic URL would put constraints on those URLs.
> We already have PSHB subscriptions to URLs which contain query parameters
> and thus it would be hard to distinguish PSHB filter parameters from query
> parameters.
> Using a distinct URL for each subscription would result in a distinct
> delivery for each matching message. Thus, if you had thousands of distinct
> filters on a single topic, you would receive thousands of deliveries.
> Embedding the filter in the query will inevitably lead to it being
> transmitted over the network repeatedly as well as being logged in various
> log files. However, filter queries often contain very sensitive information
> whose disclosure could compromise either privacy or security. Not including
> it in the URL somewhat reduces the opportunity for inadvertently disclosing
> or logging filter queries.
>
> Why use FilterId instead of just using the actual query?
>
> In many cases, users' filters will expose data that is private or sensitive.
> If filters are frequently transmitted between machines, they are more likely
> to be intercepted by malicious or unintended intermediaries. Privacy and
> security are enhanced by only passing the actual filter text when it is
> absolutely necessary.
> Using FilterId for delivery and for unsubscribe means that the actual query
> is never exposed except during the initial subscription exchange. Thus, it
> is possible to implement a "secure" system for creating subscriptions and a
> somewhat less secure system for distributing the results.
> While queries will initially probably be very simple (simple keywords), we
> can expect that queries will become more complex over time. Thus, the length
> of a query can become very large. Using a FilterId makes handling these
> things much simpler.
> While using the query would simplify debugging, the reality is that running
> systems will be more easily coded if they can rely on the ids of matched
> filters being of fixed size and numeric -- to facilitate use as database
> keys, etc.
>
> Why list all FilterIds with the delivered data?
>
> Doing so makes it possible to deliver a single response that covers a
> potentially large number of filters. A single feed update might, for a
> shared subscriber, match 10, 100's, even thousands of filters. In the case
> of many filter matches, delivering a unique response for each filter would
> waste vast amounts of network and processing resources.
> Note: Subscribers who want a distinct response for each matched filter would
> just subscribe each filter with a distinct call-back URL.
>
> Why use an MD5 hash for the FilterId?
>
> MD5 is commonly available on all platforms and in all useful languages and
> it is easy and cheap to compute.
> If users are free to provide their own FilterIds, they will inevitably abuse
> the system and use very long ids that tax the system.
> MD5 hashes are of known sizes and properties and are system independent.
> Thus, using them makes it easier to design, implement and deploy across
> systems, develop open-source implementations, etc..
>
> Why not have the subscriber pass the FilterId to the hub?
>
> If the FilterId is a simple MD5 hash, it is not necessary for the subscriber
> to pass it in.
> If the subscriber can pass in the FilterId, then they will inevitably abuse
> this -- as discussed above.
>
> Why return the FilterId as a header on the Subscription Response?
>
> To ensure that both Server and Subscriber agree on the FilterId
> To accommodate those coders who can't figure out how to compute MD5
> hashes... (i.e. don't compute it, just pull it out of the response.)
>
> Why use hub.filterid for unsubscribing instead of hub.filter?
>
> In some cases, the subscriber will have "forgotten" the mapping from filter
> to FilterId but may still be getting results for a previous subscription. In
> order to unsubscribe, they need to provide the FilterId to unsubscribe than
> thus stop the unexpected messages.
>
> Why not specify the filter syntax now?
>
> I believe the filter syntax discussion is orthogonal to the discussion of
> the framework for registering subscriptions and for getting responses.
> Ideally, folk would use some reasonable subset of XPath filtering. However,
> that might be overwhelming for some.
>
> What about firehose feeds?
>
> I'm suggesting that this extension apply to all feeds, not just to the
> still-undefined firehose feeds even though content-based filtering will be
> probably be most useful when, in fact, used with firehose feeds. If we
> assume that a firehose feed is identified by a topic (perhaps a
> ".well-known" topic...) then what we defined for normal topics should also
> apply to firehose feeds.
>
> I would appreciate any comments that you might have. I think that the
> proposal here is about as simple as it can get without introducing serious
> bandwidth or processing loads on servers and/or subscribers.
> bob wyman
>

Re: [pubsubhubbub] Content-Based Filtering for PubSubHubbub (A proposal)

Reply via email to