On Mon, Oct 19, 2009 at 4:23 PM, igrigorik <[email protected]> wrote:
> the firehose use case
+1 There are a variety of application patterns that require a firehose.
Applications that take action based on the content of entries almost
universally require a firehose or some means to approximate one. For
instance, many content-based search applications (either retrospective or
prospective) will wish to see as much as possible of the content that passes
through a hub. If the hub doesn't provide a firehose, then these
applications will end up being burdened with the need to scour the web to
find feeds, figure out which hub is serving them, and then registering
massive numbers of subscriptions which are intended to construct a rough
approximation of a firehose. These large subscription lists are likely to be
a burden not only on the subscribers but also on the hubs.
I think we'll discover that for "large" hubs, subscribers will either be
interested in a small percentage of the topics served by the hub or they
will be interested in virtually all of them. For "large" hubs, there won't
be too many applications that have a need for "middle" percentages of
topics. (you'll want < XX% of the topics or you'll want 100% of them...) So,
while PSHB today provides for the "small number of subscriptions" use case,
we really do need to address the other kind of subscription -- the
subscription to everything.
> - Spam. Just as with the ping servers, how do
> we protect the hub from becoming a spam
> distributor? Thoughts?
I think we really should treat "spam" and "firehose" as orthogonal issues.
Certainly, spam is a problem that will be exacerbated by the provision of a
firehose, however, whether or not the hub does something with spam should
not determine whether or not the hub supports firehose subscriptions.
In general, I think we should assume that spam control is up to the
subscriber. For non-firehose subscriptions, subscribers are free to
subscribe to spam if they wish. It is, of course, expected that most
subscribers will drop subscriptions to any topic that turns out to be
spammy. This is great since it allows the hub to ignore the issue of spam
and rely on subscribers to filter things. It is probably best, and within
the spirit of keeping PSHB hubs as simple as possible, to keep the hubs free
of spam detection and thus as simple as possible. Those who create firehose
subscriptions should be prepared to receive large volumes of data and should
also be prepared to do the necessary spam identification and filtering.
There is one option that would allow the hub and firehose subscribers to
share the burden of handling spam. (Note: I *do not* think we should pursue
this option.) We could allow firehose subscribers to create filters that
would be lists of feeds that they didn't want to receive. Thus, the
"firehose" would really be the "firehose minus the exceptions." While this
option has some charm to it, I think the reality is that it is probably more
complexity than is useful. It also puts quite a burden on the hub to
maintain the list of exceptions. That would not be a good thing.
> - Would have to define the subscription mechanics / protocol.
It seems to me that the simplest thing to do here is to reuse all of the
existing subscription framework and just define a "special" topic URL that
indicates a firehose subscription. Something like the following might work:
http://pubsubhubbub.googlecode.com/firehose
If we wanted to, we could decide that there will be more than one member of
the class of well-known "special URLS" and thus do something like define a
common base URL which is then extended. We might say use "WellKnown" to
indicate this class of special URLs. Thus, a firehose URL would be:
http://pubsubhubbub.googlecode.com/WellKnown/firehose
Using such a base URL makes it obvious how to extend the system. For
instance, we might one day want a hub to be able to publish an XRD file that
contains interesting bits of information about the hub and we might want
some way to allow people to be notified of when that XRD file is updated. In
that case, we might say that the way you get these updates, in a general
way, is to subscribe to:
http://pubsubhubbub.googlecode.com/WellKnown/XRD
Note: Because firehose subscriptions are often very expensive to maintain,
subscribers shouldn't be surprised if their subscriptions aren't
automatically accepted. It should be perfectly reasonable for a hub to
require that some out-of-band communication (such as email messages, phone
calls, etc.) be necessary before a firehose subscription is accepted. (We
should, however, avoid the kind of complexity that comes from doing things
like supporting deferred subscription acceptance such as is done with XMPP
PubSub...)
The really interesting question becomes: "What is the format of a firehose
feed?" We could just say that what you'll get is the same as if you'd made
many, many subscriptions. However, given that a firehose will be resource
intensive in most cases, I think it makes sense to stipulate that
aggregation will normally be used over potentially persistent HTTP
connections that do chunked content delivery. We should also consider adding
the <time/> and <SorryTooSlow/> elements that Brad found useful for the
SixApart Update Stream. (See:
http://www.sixapart.com/labs/update/developers/ )
bob wyman
On Mon, Oct 19, 2009 at 4:23 PM, igrigorik <[email protected]> wrote:
>
> This was discussed briefly before [1], but would like to bring up the
> firehose use case once again within the context of PSHB.
>
> Use case: it would be great if the spec allowed subscribing to _all_
> new content seen by the hub. For example, blogger supports PSHB, but
> short of crawling for all of the blogger feeds, there is no way to get
> the full content stream.
>
> There are a few things that would have to be resolved:
> - Spam. Just as with the ping servers, how do we protect the hub from
> becoming a spam distributor? Thoughts?
> - Would have to define the subscription mechanics / protocol. Even if
> this is an optional extension to the protocol that select hubs could
> implement. (Though, of course, I'd love to see it on the already
> deployed hubs).
>
> What else?
>
> [1]
> http://groups.google.com/group/pubsubhubbub/browse_thread/thread/7f285c23f3e361ac/690e358173c83252
>
> --------
> Ilya Grigorik
> postrank.com