Hey all, Had some ideas I've been kicking around in discussions with various folks. Would love some early feedback.
== Background The plan right now is to ditch the "Aggregated Content Distribution" section of the spec (see http://code.google.com/p/pubsubhubbub/issues/detail?id=105). There is a variety of issues with it and it's never been deployed. However, I believe there is still a need for efficient aggregated delivery that follows from Bob Wyman's ideas about content filtering (http://groups.google.com/group/pubsubhubbub/msg/820f7f29b7c22d46). Take the Google Buzz Track API for example (http://code.google.com/apis/buzz/v1/using_rest.html#activity-track). Let's say you have these two Track subscriptions registered (both PubSubHubbub topics): https://www.googleapis.com/buzz/v1/activities/track?q=Bilbo https://www.googleapis.com/buzz/v1/activities/track?q=Baggins An item comes through that matches both terms (a post with the author "Bilbo Baggins"). Your PuSH subscriber will receive *two* copies of that message, one for each subscription, each on a different callback URL that was registered when you setup the PuSH subscription. This gets much worse as the number of Track queries and potential overlaps increases; it's *especially* awful for geographic queries which intrinsically overlap. Bob's solution is to deliver a single copy of the "Bilbo Baggins" post but annotate it with *which queries* it matched. I like this idea, but I want to 1) change how we express the annotation, 2) make it easy for existing clients to migrate to the new scheme, 3) not add any new parameters (e.g., "hub.filter") to the PuSH protocol. == The Proposal PubSubHubbub-enabled feeds will declare a new aggregation relation ("http://pubsubhubbub.org/aggregation"). The "href" is picked by the publisher and is a statement of "things with this aggregation URL I can batch together into aggregated delivery." For example, with the Buzz Track API feeds we could do: <feed> <link rel="self" href="https://www.googleapis.com/buzz/v1/activities/track?q=Bilbo"/> <link rel="http://pubsubhubbub.org/aggregation" href="https://www.googleapis.com/buzz/v1/activities/combined"/> ... </feed> Subscribers would see this new "rel" link and know that they could subscribe to that new topic ("https://www.googleapis.com/buzz/v1/activities/combined") to get aggregated delivery. What does it mean to get aggregated delivery? Essentially, *all* of the subscriber's existing subscriptions with that same "aggregation" link value would *STOP* delivering, and instead the subscriber would get POSTs on a *single* callback that look like this: POST /my-aggregated-callback HTTP/1.1 Link: <https://www.googleapis.com/buzz/v1/activities/combined>; rel="http://pubsubhubbub.org/aggregation", <https://www.googleapis.com/buzz/v1/activities/track?q=Bilbo>; rel="self", <https://www.googleapis.com/buzz/v1/activities/track?q=Baggins>; rel="self" X-Hub-Signature: ... <feed> <link rel="self" href="https://www.googleapis.com/buzz/v1/activities/combined"/> <link rel="http://pubsubhubbub.org/aggregation" href="https://www.googleapis.com/buzz/v1/activities/combined"/> ... </feed> Thus you will only get one copy of each item. The list of queries matched will be in the Link header so users know why they're getting the item. This proposal would fundamentally decouple subscription verification from event delivery. If the subscriber adds a new PuSH subscription with the same "aggregation" link value, non-obviously it will use the normal callback URL for PuSH verification but send all content delivery to the aggregated callback. Unsubscription will also use a separate callback URL for verification. If the subscriber unsubscribes from the aggregation URL, then all of the subscriptions will revert back to the old way of doing things. == Open questions Random list of questions: * What granularity do you use to move the existing subscriptions to the aggregated endpoint? Does the publisher do it by domain, by URL prefix, by some other token? * Should the "self" links in the aggregated delivery be for the feeds you subscribed to, or should you instead pass through the callback URLs that *would* have been used for normal delivery? The latter approach could be useful for subscribers who put context data into their callback URLs. * Will this allow us to finally put the Topic header in content delivery as users have requested a million times? (http://code.google.com/p/pubsubhubbub/issues/detail?id=79) * Can this scheme be reused for aggregated delivery across different sites, so subscribers get fewer POSTs? Thanks for reading! -Brett
