I think this whole approach only makes sense if there are lots of
hubs, and even then I am not sure.  If there are only a few hubs, then
the firehose really is a feed of "everything".

If there are more than a few hubs, eg 'specialised' hubs (eg a
wordpress hub, a blogger hub) people will want to use filters if they
have to pay for a lot of bandwidth.  But why put all this complexity
in the protocol?



On Mon, Oct 19, 2009 at 10:43 PM, Bob Wyman <[email protected]> wrote:
> On Mon, Oct 19, 2009 at 4:23 PM, igrigorik <[email protected]> wrote:
>> the firehose use case
> +1 There are a variety of application patterns that require a firehose.
> Applications that take action based on the content of entries almost
> universally require a firehose or some means to approximate one. For
> instance, many content-based search applications (either retrospective or
> prospective) will wish to see as much as possible of the content that passes
> through a hub. If the hub doesn't provide a firehose, then these
> applications will end up being burdened with the need to scour the web to
> find feeds, figure out which hub is serving them, and then registering
> massive numbers of subscriptions which are intended to construct a rough
> approximation of a firehose. These large subscription lists are likely to be
> a burden not only on the subscribers but also on the hubs.
>
> I think we'll discover that for "large" hubs, subscribers will either be
> interested in a small percentage of the topics served by the hub or they
> will be interested in virtually all of them. For "large" hubs, there won't
> be too many applications that have a need for "middle" percentages of
> topics. (you'll want < XX% of the topics or you'll want 100% of them...) So,
> while PSHB today provides for the "small number of subscriptions" use case,
> we really do need to address the other kind of subscription -- the
> subscription to everything.
>
>>  - Spam. Just as with the ping servers, how do
>> we protect the hub from becoming a spam
>> distributor? Thoughts?
> I think we really should treat "spam" and "firehose" as orthogonal issues.
> Certainly, spam is a problem that will be exacerbated by the provision of a
> firehose, however, whether or not the hub does something with spam should
> not determine whether or not the hub supports firehose subscriptions.
>
> In general, I think we should assume that spam control is up to the
> subscriber. For non-firehose subscriptions, subscribers are free to
> subscribe to spam if they wish. It is, of course, expected that most
> subscribers will drop subscriptions to any topic that turns out to be
> spammy. This is great since it allows the hub to ignore the issue of spam
> and rely on subscribers to filter things. It is probably best, and within
> the spirit of keeping PSHB hubs as simple as possible, to keep the hubs free
> of spam detection and thus as simple as possible. Those who create firehose
> subscriptions should be prepared to receive large volumes of data and should
> also be prepared to do the necessary spam identification and filtering.
>
> There is one option that would allow the hub and firehose subscribers to
> share the burden of handling spam. (Note: I *do not* think we should pursue
> this option.) We could allow firehose subscribers to create filters that
> would be lists of feeds that they didn't want to receive. Thus, the
> "firehose" would really be the "firehose minus the exceptions." While this
> option has some charm to it, I think the reality is that it is probably more
> complexity than is useful. It also puts quite a burden on the hub to
> maintain the list of exceptions. That would not be a good thing.
>
>>  - Would have to define the subscription mechanics / protocol.
> It seems to me that the simplest thing to do here is to reuse all of the
> existing subscription framework and just define a "special" topic URL that
> indicates a firehose subscription. Something like the following might work:
>      http://pubsubhubbub.googlecode.com/firehose
>
> If we wanted to, we could decide that there will be more than one member of
> the class of well-known "special URLS" and thus do something like define a
> common base URL which is then extended. We might say use "WellKnown" to
> indicate this class of special URLs. Thus, a firehose URL would be:
>     http://pubsubhubbub.googlecode.com/WellKnown/firehose
> Using such a base URL makes it obvious how to extend the system. For
> instance, we might one day want a hub to be able to publish an XRD file that
> contains interesting bits of information about the hub and we might want
> some way to allow people to be notified of when that XRD file is updated. In
> that case, we might say that the way you get these updates, in a general
> way, is to subscribe to:
>     http://pubsubhubbub.googlecode.com/WellKnown/XRD
>
> Note: Because firehose subscriptions are often very expensive to maintain,
> subscribers shouldn't be surprised if their subscriptions aren't
> automatically accepted. It should be perfectly reasonable for a hub to
> require that some out-of-band communication (such as email messages, phone
> calls, etc.) be necessary before a firehose subscription is accepted. (We
> should, however, avoid the kind of complexity that comes from doing things
> like supporting deferred subscription acceptance such as is done with XMPP
> PubSub...)
>
> The really interesting question becomes: "What is the format of a firehose
> feed?" We could just say that what you'll get is the same as if you'd made
> many, many subscriptions. However, given that a firehose will be resource
> intensive in most cases, I think it makes sense to stipulate that
> aggregation will normally be used over potentially persistent HTTP
> connections that do chunked content delivery. We should also consider adding
> the <time/> and <SorryTooSlow/> elements that Brad found useful for the
> SixApart Update Stream. (See:
> http://www.sixapart.com/labs/update/developers/ )
>
> bob wyman
>
> On Mon, Oct 19, 2009 at 4:23 PM, igrigorik <[email protected]> wrote:
>>
>> This was discussed briefly before [1], but would like to bring up the
>> firehose use case once again within the context of PSHB.
>>
>> Use case: it would be great if the spec allowed subscribing to _all_
>> new content seen by the hub. For example, blogger supports PSHB, but
>> short of crawling for all of the blogger feeds, there is no way to get
>> the full content stream.
>>
>> There are a few things that would have to be resolved:
>>  - Spam. Just as with the ping servers, how do we protect the hub from
>> becoming a spam distributor? Thoughts?
>>  - Would have to define the subscription mechanics / protocol. Even if
>> this is an optional extension to the protocol that select hubs could
>> implement. (Though, of course, I'd love to see it on the already
>> deployed hubs).
>>
>> What else?
>>
>> [1]
>> http://groups.google.com/group/pubsubhubbub/browse_thread/thread/7f285c23f3e361ac/690e358173c83252
>>
>> --------
>> Ilya Grigorik
>> postrank.com
>

Reply via email to