On Tue, Jun 29, 2010 at 11:09 PM, Bob Wyman <[email protected]> wrote: > Now that we've got substantial experience with topic-based PubSubHubbub,
With all due respect - I don't think we have enough yet. The spec is still unstable. alexis > I > think it is time we began considering how we would extend the current > protocol to support content-based filtering of topics... We've casually > discussed this in the past, I suggest that we address the subject seriously. > In a topic based subscription system, a subscriber receives all messages > published to any topic to which they have subscribed. > With content-based filtering, the subscriber is able to associate one or > more filters with a topic and will then only receive those messages which > are published to the topic and which match at least one of the associated > filters. Thus, given a topic based on a feed of weather-related > announcements, I might create a subscription with a filter that indicates > that I'm only interested in messages that are "Storm Alerts" and that are > intended for the New York City area. I might also subscribe to "Air > Pollution Alerts" in the same area. Given such subscriptions, I would only > receive those messages published to the topic that matched one of my filters > even if there were many orders of magnitude more messages actually published > to the topic. > I would like to suggest the following extensions to support content-based > filtering: > > In "6.1. Subscriber Sends Subscription Request" Add an optional "hub.filter" > which would contain a textual filter expression to be specified in mode > "subscribe". (syntax to be discussed later). > In "6.1 Subscriber Sends Subscription Request" Add an optional > "hub.filterid" which would be used in mode "unsubscribe". > In "6.1.2 Subscription Response Details": The hub's response should include > the header field X-Hub-FilterId which would contain a unique id for the > filter. I believe we should require that this be an MD5 hash of the query. > In "6.2 Hub Verifies Intent of the Subscriber" Include hub.filter in the > request and provide X-Hub-FilterId in the header > In "7.3 Content Distribution" For each subscriber/callback-url that had > registered a filter on a topic, a single POST would be generated -- as long > as one or more of those filters matched published content. In each POST > request from hub to subscriber's callback URL, there would be an > X-Hub-FilterId field that lists all of the filter-ids for filters that match > the published data. > Also, when processing feeds which contain more than one entry or item, the > feeds should be "broken up" so that content-based matching and thus > distribution is done against only one entry or item at a time. > In "2. Definitions" Add a definitions for "filter" and filter-id. > > Q&A: (In which I try to anticipate the obvious questions...) > > Why not put the filter query in the Topic URL and eliminate the hub.filter? > > Any attempt to use the Topic URL would put constraints on those URLs. > We already have PSHB subscriptions to URLs which contain query parameters > and thus it would be hard to distinguish PSHB filter parameters from query > parameters. > Using a distinct URL for each subscription would result in a distinct > delivery for each matching message. Thus, if you had thousands of distinct > filters on a single topic, you would receive thousands of deliveries. > Embedding the filter in the query will inevitably lead to it being > transmitted over the network repeatedly as well as being logged in various > log files. However, filter queries often contain very sensitive information > whose disclosure could compromise either privacy or security. Not including > it in the URL somewhat reduces the opportunity for inadvertently disclosing > or logging filter queries. > > Why use FilterId instead of just using the actual query? > > In many cases, users' filters will expose data that is private or sensitive. > If filters are frequently transmitted between machines, they are more likely > to be intercepted by malicious or unintended intermediaries. Privacy and > security are enhanced by only passing the actual filter text when it is > absolutely necessary. > Using FilterId for delivery and for unsubscribe means that the actual query > is never exposed except during the initial subscription exchange. Thus, it > is possible to implement a "secure" system for creating subscriptions and a > somewhat less secure system for distributing the results. > While queries will initially probably be very simple (simple keywords), we > can expect that queries will become more complex over time. Thus, the length > of a query can become very large. Using a FilterId makes handling these > things much simpler. > While using the query would simplify debugging, the reality is that running > systems will be more easily coded if they can rely on the ids of matched > filters being of fixed size and numeric -- to facilitate use as database > keys, etc. > > Why list all FilterIds with the delivered data? > > Doing so makes it possible to deliver a single response that covers a > potentially large number of filters. A single feed update might, for a > shared subscriber, match 10, 100's, even thousands of filters. In the case > of many filter matches, delivering a unique response for each filter would > waste vast amounts of network and processing resources. > Note: Subscribers who want a distinct response for each matched filter would > just subscribe each filter with a distinct call-back URL. > > Why use an MD5 hash for the FilterId? > > MD5 is commonly available on all platforms and in all useful languages and > it is easy and cheap to compute. > If users are free to provide their own FilterIds, they will inevitably abuse > the system and use very long ids that tax the system. > MD5 hashes are of known sizes and properties and are system independent. > Thus, using them makes it easier to design, implement and deploy across > systems, develop open-source implementations, etc.. > > Why not have the subscriber pass the FilterId to the hub? > > If the FilterId is a simple MD5 hash, it is not necessary for the subscriber > to pass it in. > If the subscriber can pass in the FilterId, then they will inevitably abuse > this -- as discussed above. > > Why return the FilterId as a header on the Subscription Response? > > To ensure that both Server and Subscriber agree on the FilterId > To accommodate those coders who can't figure out how to compute MD5 > hashes... (i.e. don't compute it, just pull it out of the response.) > > Why use hub.filterid for unsubscribing instead of hub.filter? > > In some cases, the subscriber will have "forgotten" the mapping from filter > to FilterId but may still be getting results for a previous subscription. In > order to unsubscribe, they need to provide the FilterId to unsubscribe than > thus stop the unexpected messages. > > Why not specify the filter syntax now? > > I believe the filter syntax discussion is orthogonal to the discussion of > the framework for registering subscriptions and for getting responses. > Ideally, folk would use some reasonable subset of XPath filtering. However, > that might be overwhelming for some. > > What about firehose feeds? > > I'm suggesting that this extension apply to all feeds, not just to the > still-undefined firehose feeds even though content-based filtering will be > probably be most useful when, in fact, used with firehose feeds. If we > assume that a firehose feed is identified by a topic (perhaps a > ".well-known" topic...) then what we defined for normal topics should also > apply to firehose feeds. > > I would appreciate any comments that you might have. I think that the > proposal here is about as simple as it can get without introducing serious > bandwidth or processing loads on servers and/or subscribers. > bob wyman >
