Re: [pubsubhubbub] Options for firehoses and filtering

Bob Wyman Mon, 15 Mar 2010 13:24:22 -0700

I think it would be useful in this discussion to distinguish between "pure
firehose" feeds (which would send everything that a hub receives) and
"tracking" feeds that produce a filtered version of everything that a hub
receives. (i.e. content-based filtering based on things like geo-codes or
words/phrases that appear in published content.)


The reality is that there won't be many consumers of pure firehose feeds --
at least not for the major hubs, if only because they can publish at
exceptionally high-volumes and will tax or even overwhelm the resources of
most potential consumers. (How many people have servers that can consume
XX,XXX messages per second?) As such, it is likely that the "fuck it"
strategy would work for pure firehose feeds. The number of publishers will
be small and the number of subscribers will be small. Thus, it is
reasonable, although unfortunate, to expect them to link up on their own and
work out private arrangements for firehose feed access. (Note: Firehose
feeds are often expensive to provide and folk shouldn't always have the
expectation that they will be able to access them for free...)

However, the content-based, filtered, "tracking" feeds are a potentially
different story since they will be used to retrieve a potentially tiny
subset of the full firehose feed. (i.e. "posts in New York City" or "posts
that contain the word 'foobar'" are a tiny subset of "all posts in the
world.") Because content-based feeds are going to be useful to a much larger
population of consumers and a large number of consumers are probably
technically capable of processing them, it is reasonable that we do
something in the protocol to provide for them.

Fortunately, I think there is a solution that works for both kinds of feed
and that lines up with recent additions to our understanding of URLs...
PSHB currently allows you to subscribe to feeds/topics by URL. Well, why not
create what is essentially a set of "well-known" "topics/URLs" that identify
"firehose" feeds?

So, to subscribe to a firehose feed, I would use an URL that had something
like ".well-known/firehose" as the value of hub.topic. In order to get a
filtered feed, I would pass an additional parameter, "hub.query."

To subscribe to a firehose feed, I would use:
  hub.topic = ".well-known/firehose"
To get all entries that contain the word "foobar" in the atom:title, I would
do the following:
   hub.topic = ".well-known/firehose"
   hub.query = "title:foobar"
To get all posts which originate at "example.com/feed.atom" and that contain
"foo bar" in either the title or summary, I would say:
   hub.topic = "example.com"
   hub.query = "summary:'foo bar' OR title:'foo bar'"

All other parts of the protocol would stay the same -- with one exception --
and that is that we'd need the hub to be able to respond to a subscription
request with a message that says: "Please contact [email protected] to arrange
for access."

bob wyman

On Sun, Mar 14, 2010 at 3:17 PM, Brett Slatkin <[email protected]> wrote:

> Hey all,
>
> Here's some rough notes that Julien and I came up with at SxSW this
> year to talk about the options for using virtual feeds (eg, firehoses,
> filtering, track, geo bounaries) with PubSubHubbub. We got some nice
> input from bradfitz, Eric Marcoullier (from Gnip), Ilya Grigorik (from
> postrank), and of course, Mr. Filtering himself, Bob Wyman.
>
> Please note that order in this doc is not significant at all, we just
> wanted to get the options out there. If you have any additional
> variants of these specific options or a whole new option let us know.
>
> Thanks in advance for your feedback!
>
> -Brett
>
> ---------------
>
> 1. Use XRD
>
> - [email protected] has a feed
> - Could also work on an arbitrary URI for a domain
> - Could also work on the Hub URL
>
> Do some WebFinger: find example.com/.well-known/host-meta
>
> contains:
> <link rel="http://pubsubhubbub.org/full-feed";
> href="http://buzz.google.com/full-feed"/>
>
> This full feed URL could be a link to subscribe to or it could be an
> HTML page that says how to get approval for the firehose. You could
> have a click-through ToS to accept some terms, generate a one-off
> firehose URL, charge money, whatever you want.
>
> Good things
> - No change to hubbub protocol
>
> Bad
> - Have to fetch/parse XRD for discovery
> - Per feed basis not a per hub if the discovery is not on the hub url
> (so custom domains would require firehose discovery every time; would
> also like for one domain to have multiple different hubs for
> syndication)
>
>
> 2. Link relation in the feed itself
>
> Put something like:
>
> <atom:link rel="supersauce" href="http://buzz.google.com/full-feed"/>
>
> In every feed produced by a publisher.
>
> Good:
> - No new discovery document
> - Exactly the same discovery flow except different link relation
>
> Bad:
> - Have to add this link relation to every feed doc
> - New features for additional relation types require publisher to
> change their feed yet again (so hub functionality is too tightly
> coupled with the publish's feed, as opposed to delegation to the hub
> for discovering what the hub can do on behalf of the feed)
>
>
> 3. Verification request includes discovery information
>
> You find a feed, it has some hub urls, you subscribe and then you see
> on the verification request something like:
>
> hub.extension.fullfeed=http://example.com/full-feed
>
> And then you know that you could go back to the hub and subscribe to
> the full firehose.
>
> Could also use URI templating in here for doing specific kinds of
> filtering (using the templating spec
>
> http://bitworking.org/projects/URI-Templates/spec/draft-gregorio-uritemplate-03.html
> )
>
> hub.extension.filter=
> http://example.com/filter?params={{params}}&box={{lat/lot,lat/lon}}
>
> Another variant is these extra params could be in the headers of a
> notification request.
>
> Good:
> - Decouples hub functionality from feed publisher so hub can add new
> features without publisher changes
> - No extra queries or polling to find the extra features of the hub
>
> Bad:
> - Mixing verification and feature discovery is kinda weird (subscriber
> would presumably unsubscribe from the same feed once they found the
> firehose and that's kinda weird)
> - Not clear at all how this would work with authorization of the subscriber
> - Unclear if this should be part of the base spec or if we should wait
>
>
> 4. Fuck it
>
> Don't define it. Everyone does virtual feeds/filtering/firehose
> declaration a little different and users just figure out how to use
> their favorite provider.
>
> Pros:
> - Simplify the spec by taking out aggregated delivery (which is kind
> of broken in the base spec right now anyways because we're overriding
> what atom:source is actually for)
>
> Cons:
> - Different providers may completely diverge
>
>
> 5. Like #1 except skip XRD and use a new mode
>
> Do a query on the hub URL like:
>
> http://example.com/hub?hub.mode=whatsup
>
> This returns a 302 or an HTML doc or something that some human needs
> to inspect to figure out what they can do with this hub, some of which
> may be programmatic.
>

Re: [pubsubhubbub] Options for firehoses and filtering

Reply via email to