I think this whole approach only makes sense if there are lots of hubs, and even then I am not sure. If there are only a few hubs, then the firehose really is a feed of "everything".
If there are more than a few hubs, eg 'specialised' hubs (eg a wordpress hub, a blogger hub) people will want to use filters if they have to pay for a lot of bandwidth. But why put all this complexity in the protocol? On Mon, Oct 19, 2009 at 10:43 PM, Bob Wyman <[email protected]> wrote: > On Mon, Oct 19, 2009 at 4:23 PM, igrigorik <[email protected]> wrote: >> the firehose use case > +1 There are a variety of application patterns that require a firehose. > Applications that take action based on the content of entries almost > universally require a firehose or some means to approximate one. For > instance, many content-based search applications (either retrospective or > prospective) will wish to see as much as possible of the content that passes > through a hub. If the hub doesn't provide a firehose, then these > applications will end up being burdened with the need to scour the web to > find feeds, figure out which hub is serving them, and then registering > massive numbers of subscriptions which are intended to construct a rough > approximation of a firehose. These large subscription lists are likely to be > a burden not only on the subscribers but also on the hubs. > > I think we'll discover that for "large" hubs, subscribers will either be > interested in a small percentage of the topics served by the hub or they > will be interested in virtually all of them. For "large" hubs, there won't > be too many applications that have a need for "middle" percentages of > topics. (you'll want < XX% of the topics or you'll want 100% of them...) So, > while PSHB today provides for the "small number of subscriptions" use case, > we really do need to address the other kind of subscription -- the > subscription to everything. > >> - Spam. Just as with the ping servers, how do >> we protect the hub from becoming a spam >> distributor? Thoughts? > I think we really should treat "spam" and "firehose" as orthogonal issues. > Certainly, spam is a problem that will be exacerbated by the provision of a > firehose, however, whether or not the hub does something with spam should > not determine whether or not the hub supports firehose subscriptions. > > In general, I think we should assume that spam control is up to the > subscriber. For non-firehose subscriptions, subscribers are free to > subscribe to spam if they wish. It is, of course, expected that most > subscribers will drop subscriptions to any topic that turns out to be > spammy. This is great since it allows the hub to ignore the issue of spam > and rely on subscribers to filter things. It is probably best, and within > the spirit of keeping PSHB hubs as simple as possible, to keep the hubs free > of spam detection and thus as simple as possible. Those who create firehose > subscriptions should be prepared to receive large volumes of data and should > also be prepared to do the necessary spam identification and filtering. > > There is one option that would allow the hub and firehose subscribers to > share the burden of handling spam. (Note: I *do not* think we should pursue > this option.) We could allow firehose subscribers to create filters that > would be lists of feeds that they didn't want to receive. Thus, the > "firehose" would really be the "firehose minus the exceptions." While this > option has some charm to it, I think the reality is that it is probably more > complexity than is useful. It also puts quite a burden on the hub to > maintain the list of exceptions. That would not be a good thing. > >> - Would have to define the subscription mechanics / protocol. > It seems to me that the simplest thing to do here is to reuse all of the > existing subscription framework and just define a "special" topic URL that > indicates a firehose subscription. Something like the following might work: > http://pubsubhubbub.googlecode.com/firehose > > If we wanted to, we could decide that there will be more than one member of > the class of well-known "special URLS" and thus do something like define a > common base URL which is then extended. We might say use "WellKnown" to > indicate this class of special URLs. Thus, a firehose URL would be: > http://pubsubhubbub.googlecode.com/WellKnown/firehose > Using such a base URL makes it obvious how to extend the system. For > instance, we might one day want a hub to be able to publish an XRD file that > contains interesting bits of information about the hub and we might want > some way to allow people to be notified of when that XRD file is updated. In > that case, we might say that the way you get these updates, in a general > way, is to subscribe to: > http://pubsubhubbub.googlecode.com/WellKnown/XRD > > Note: Because firehose subscriptions are often very expensive to maintain, > subscribers shouldn't be surprised if their subscriptions aren't > automatically accepted. It should be perfectly reasonable for a hub to > require that some out-of-band communication (such as email messages, phone > calls, etc.) be necessary before a firehose subscription is accepted. (We > should, however, avoid the kind of complexity that comes from doing things > like supporting deferred subscription acceptance such as is done with XMPP > PubSub...) > > The really interesting question becomes: "What is the format of a firehose > feed?" We could just say that what you'll get is the same as if you'd made > many, many subscriptions. However, given that a firehose will be resource > intensive in most cases, I think it makes sense to stipulate that > aggregation will normally be used over potentially persistent HTTP > connections that do chunked content delivery. We should also consider adding > the <time/> and <SorryTooSlow/> elements that Brad found useful for the > SixApart Update Stream. (See: > http://www.sixapart.com/labs/update/developers/ ) > > bob wyman > > On Mon, Oct 19, 2009 at 4:23 PM, igrigorik <[email protected]> wrote: >> >> This was discussed briefly before [1], but would like to bring up the >> firehose use case once again within the context of PSHB. >> >> Use case: it would be great if the spec allowed subscribing to _all_ >> new content seen by the hub. For example, blogger supports PSHB, but >> short of crawling for all of the blogger feeds, there is no way to get >> the full content stream. >> >> There are a few things that would have to be resolved: >> - Spam. Just as with the ping servers, how do we protect the hub from >> becoming a spam distributor? Thoughts? >> - Would have to define the subscription mechanics / protocol. Even if >> this is an optional extension to the protocol that select hubs could >> implement. (Though, of course, I'd love to see it on the already >> deployed hubs). >> >> What else? >> >> [1] >> http://groups.google.com/group/pubsubhubbub/browse_thread/thread/7f285c23f3e361ac/690e358173c83252 >> >> -------- >> Ilya Grigorik >> postrank.com >
