Hey all,

Had some ideas I've been kicking around in discussions with various
folks. Would love some early feedback.


== Background

The plan right now is to ditch the "Aggregated Content Distribution"
section of the spec (see
http://code.google.com/p/pubsubhubbub/issues/detail?id=105). There is
a variety of issues with it and it's never been deployed. However, I
believe there is still a need for efficient aggregated delivery that
follows from Bob Wyman's ideas about content filtering
(http://groups.google.com/group/pubsubhubbub/msg/820f7f29b7c22d46).

Take the Google Buzz Track API for example
(http://code.google.com/apis/buzz/v1/using_rest.html#activity-track).
Let's say you have these two Track subscriptions registered (both
PubSubHubbub topics):

https://www.googleapis.com/buzz/v1/activities/track?q=Bilbo
https://www.googleapis.com/buzz/v1/activities/track?q=Baggins

An item comes through that matches both terms (a post with the author
"Bilbo Baggins"). Your PuSH subscriber will receive *two* copies of
that message, one for each subscription, each on a different callback
URL that was registered when you setup the PuSH subscription. This
gets much worse as the number of Track queries and potential overlaps
increases; it's *especially* awful for geographic queries which
intrinsically overlap.

Bob's solution is to deliver a single copy of the "Bilbo Baggins" post
but annotate it with *which queries* it matched. I like this idea, but
I want to 1) change how we express the annotation, 2) make it easy for
existing clients to migrate to the new scheme, 3) not add any new
parameters (e.g., "hub.filter") to the PuSH protocol.


== The Proposal

PubSubHubbub-enabled feeds will declare a new aggregation relation
("http://pubsubhubbub.org/aggregation";). The "href" is picked by the
publisher and is a statement of "things with this aggregation URL I
can batch together into aggregated delivery." For example, with the
Buzz Track API feeds we could do:

<feed>
  <link rel="self"
href="https://www.googleapis.com/buzz/v1/activities/track?q=Bilbo"/>
  <link rel="http://pubsubhubbub.org/aggregation";
href="https://www.googleapis.com/buzz/v1/activities/combined"/>
  ...
</feed>

Subscribers would see this new "rel" link and know that they could
subscribe to that new topic
("https://www.googleapis.com/buzz/v1/activities/combined";) to get
aggregated delivery. What does it mean to get aggregated delivery?
Essentially, *all* of the subscriber's existing subscriptions with
that same "aggregation" link value would *STOP* delivering, and
instead the subscriber would get POSTs on a *single* callback that
look like this:

POST /my-aggregated-callback HTTP/1.1
Link: <https://www.googleapis.com/buzz/v1/activities/combined>;
rel="http://pubsubhubbub.org/aggregation";,
        <https://www.googleapis.com/buzz/v1/activities/track?q=Bilbo>;
rel="self",
        <https://www.googleapis.com/buzz/v1/activities/track?q=Baggins>;
rel="self"
X-Hub-Signature: ...

<feed>
  <link rel="self"
href="https://www.googleapis.com/buzz/v1/activities/combined"/>
  <link rel="http://pubsubhubbub.org/aggregation";
href="https://www.googleapis.com/buzz/v1/activities/combined"/>
  ...
</feed>

Thus you will only get one copy of each item. The list of queries
matched will be in the Link header so users know why they're getting
the item.

This proposal would fundamentally decouple subscription verification
from event delivery. If the subscriber adds a new PuSH subscription
with the same "aggregation" link value, non-obviously it will use the
normal callback URL for PuSH verification but send all content
delivery to the aggregated callback. Unsubscription will also use a
separate callback URL for verification. If the subscriber unsubscribes
from the aggregation URL, then all of the subscriptions will revert
back to the old way of doing things.


== Open questions

Random list of questions:

* What granularity do you use to move the existing subscriptions to
the aggregated endpoint? Does the publisher do it by domain, by URL
prefix, by some other token?
* Should the "self" links in the aggregated delivery be for the feeds
you subscribed to, or should you instead pass through the callback
URLs that *would* have been used for normal delivery? The latter
approach could be useful for subscribers who put context data into
their callback URLs.
* Will this allow us to finally put the Topic header in content
delivery as users have requested a million times?
(http://code.google.com/p/pubsubhubbub/issues/detail?id=79)
* Can this scheme be reused for aggregated delivery across different
sites, so subscribers get fewer POSTs?


Thanks for reading!

-Brett

Reply via email to