[pubsubhubbub] Re: PSHB Firehose

igrigorik Thu, 22 Oct 2009 06:22:35 -0700

Ok, so let me try rephrasing the problem. The major problem is not the
Hub, or the spec, but the need to crawl thousands of sites to find the
feeds in the first place. Of course, the hub already knows about them,
hence this discussion.


Would it be crazy to then expose a mechanism to enumerate all of the
feeds that a hub tracks, and let the client then subscribe to them?

An obvious obstacle is private feeds, but I wonder if that can be
handled as a special case?

ig

On Oct 21, 1:55 pm, Marcus Herou <[email protected]> wrote:
> Well... We can as well publish a firehose but currently our business
> model is not aimed at that.
>
> I was not talking about a service rather a technology which could take
> the crawling business to another level by aggregating hundreds of hubs
> and creating something which effectively can deliver tb/s bandwidth by
> having decentralized servers and data. We are still limited you know
> by our infrastructure even though it have gb/s to the internet.
>
> Since the realtime web is currently still very small all of us need to
> poll something even you I presume to be able to create a pub/sub arch.
> In that remark our companies are quite similar, you chose to aggregate
> and publish your data and make a business of it. We aggregate and
> refine the data and make business out of that.
>
> Dont take me wrong I really like what you do but i am not looking for
> a data supplier at this time ( might change though ). But if I would
> look in the data supplier direction you are currently in my/our top
> ten list :)
>
> Skickat från min iPhone
>
> On Oct 21, 5:03 pm, Julien Genestoux <[email protected]>
> wrote:
>
> > Hum...http://superfeedr.com?
>
> > "Putting ressources in common" is definetely one of the key reasons why we
> > built superfeedr. More about that there 
> > :http://blog.superfeedr.com/gospel/something-stupid/
>
> > And yes, we have a firehose available.
>
> > Julien
>
> > --
> > Julien Genestoux,
>
> >http://twitter.com/julien51http://superfeedr.com
>
> > +1 (415) 254 7340
> > +33 (0)9 70 44 76 29
>
> > On Wed, Oct 21, 2009 at 5:26 AM, Marcus Herou 
> > <[email protected]>wrote:
>
> > > Feedtree looks cool.... but updated 2006 ?
>
> > > On Wed, Oct 21, 2009 at 2:20 PM, Nick Johnson (Google) <
> > > [email protected]> wrote:
>
> > >> On Wed, Oct 21, 2009 at 1:14 PM, Alexis Richardson <
> > >> [email protected]> wrote:
>
> > >>> Hmmm ... gossiptorrent?
>
> > >> Feedtree.
>
> > >>> On Wed, Oct 21, 2009 at 7:23 AM, Marcus Herou
> > >>> <[email protected]> wrote:
>
> > >>> > Hi.
>
> > >>> > We host a search app which is based on feeds of blogs/twitter/forums/
> > >>> > news etc. We are as you are mentioning polling everything like crazy
> > >>> > and it seems like a total waste of everyones resources.
>
> > >>> > So this means that subscribing to something which would potentially
> > >>> > deliver the material to us would be great not just for us but as well
> > >>> > all sites we are crawling.
>
> > >>> > However who would like to open up a firehose for free for everyone to
> > >>> > consume ? It will for sure consume a lot of bandwidth and a few
> > >>> > subscribers will consume most of the bandwidth with this model.
> > >>> > I thought of something that might solve this issue. Consider the
> > >>> > following:
>
> > >>> > 1)
> > >>> > * Charge for the bandwidth (wordpress.com does this with flat fee)
>
> > >>> > 2)
> > >>> > * Everyone that have firehose consuming needs should as well start a
> > >>> > hub to show good faith and morale.
> > >>> > * Add support in firehose enabled hubs to share state (with a
> > >>> > master ?)
> > >>> > * A firehose enabled hub can subscribe to a master hub which makes
> > >>> > sure that the subscriber as well fulfils some form of contract (i.e.
> > >>> > actually updating/delivering feeds)
> > >>> > * Each firehose enabled hub must be public and everyone can subscribe
> > >>> > to the feeds like as of current.
> > >>> > * To share load equally (morale part) then subscribers should
> > >>> > subscribe to a loadbalanced dns name or some form of delegate
> > >>> >  lb.pshb.com = master hub
> > >>> >  Example 1: lb.pshb.com resolves to pshb.tailsweep.com
> > >>> > pshb.google.com, effectively DNS-roundrobin
> > >>> >  Example 2: lb.pshb.com delegates to any active master connected hub
> > >>> > in some way.
>
> > >>> > This might be too complex to implement and bottlenecks occur at the
> > >>> > master but systems like Hadoop have bottlenecks in terms of the
> > >>> > NameNode (master) and it seems to perform just perfect so it can be
> > >>> > done. However each firehose hub probably need to persist each feed for
> > >>> > a certain amount of time before purging it.
>
> > >>> > Anyway this was just a thought. We at Tailsweep probably could help in
> > >>> > making this happen if there exists some interest.
>
> > >>> > Cheers
>
> > >>> > //Marcus
>
> > >>> > On Oct 20, 8:41 pm, Bob Wyman <[email protected]> wrote:
> > >>> >> On Tue, Oct 20, 2009 at 11:22 AM, igrigorik <[email protected]> wrote:
> > >>> >> > Specifically, if we treat 'firehose' as any bundle of
> > >>> >> > feeds (all, or some), then a hub could define
> > >>> >> > multiple firehose streams.
>
> > >>> >> There should be no question that there is tremendous utility in being
> > >>> able
> > >>> >> to compose all sorts of "bundles" of topics into distinct feeds. It 
> > >>> >> is
> > >>> >> probably also the case that we can identify some number of such
> > >>> bundles that
> > >>> >> would be useful to a large number of subscribers. On the other hand,
> > >>> many
> > >>> >> bundles will be very specific and only useful to one or a small 
> > >>> >> number
> > >>> of
> > >>> >> subscribers. In fact, I think what we'll see is that once we have the
> > >>> core
> > >>> >> PSHB defined, we'll then see innovation in the definition of "down
> > >>> stream"
> > >>> >> services whose function is precisely to build and deliver such
> > >>> bundles. Some
> > >>> >> of these services will aggregate groups of topics while others will
> > >>> focus
> > >>> >> instead on creating content-based streams -- they will bundle 
> > >>> >> together
> > >>> >> individual entries based on the content of those entries rather than
> > >>> simply
> > >>> >> combining all entries from some set of topics.
>
> > >>> >> I think we should be careful not to force too much of the burden of
> > >>> bundling
> > >>> >> or aggregating into the core PSHB hub specification. If we want to
> > >>> address
> > >>> >> the challenges of building bundles or aggregations, I think it best 
> > >>> >> to
> > >>> do so
> > >>> >> in secondary or companion specifications. This will keep the core
> > >>> cleaner
> > >>> >> and easy to understand while also allowing the core to be deployed
> > >>> without
> > >>> >> being delayed by discussions over non-core issues.
>
> > >>> >> Having argued against making the core more complicated by extending 
> > >>> >> it
> > >>> to
> > >>> >> include creating aggregate topics, I still suggest that it would be
> > >>> useful
> > >>> >> to have the core system define a common means to obtain a pure
> > >>> "firehose"
> > >>> >> feed of all topics. The current hub spec works for people who only
> > >>> want
> > >>> >> "none or some" of the topics served by the hub. I suggest that we
> > >>> expand
> > >>> >> this to have hubs know how to provide "none, some or all" of the
> > >>> topics.
> > >>> >> The reason for adding support of "all topics" is that we know, 
> > >>> >> without
> > >>> much
> > >>> >> question, that such an "all topics" feed will be required by many of
> > >>> the
> > >>> >> downstream services that we will one day be relying on to create more
> > >>> finely
> > >>> >> defined aggregations. Given that this specific feed will be commonly
> > >>> >> required, it would be best if we had a common mechanism for a
> > >>> downstream
> > >>> >> service/subscriber to request that feed and that we set some
> > >>> expectations
> > >>> >> for how that feed will be formatted and delivered (i.e. Atom entries,
> > >>> >> persistent connections, chunked content model, ...). It would be very
> > >>> >> cumbersome for a downstream filtering/aggregating service to need to
> > >>> puzzle
> > >>> >> through service specific mechanisms for discovering how to obtain a
> > >>> firehose
> > >>> >> feed of "all topics" from many different hubs.
>
> > >>> >> bob wyman
>
> > >>> >> On Tue, Oct 20, 2009 at 11:22 AM, igrigorik <[email protected]> wrote:
>
> > >>> >> > Right, so how does the smart hub aggregate the feeds? Does it then
> > >>> >> > have to crawl to find the list? That wouldn't be very useful. 
> > >>> >> > Having
> > >>> >> > said that...
>
> > >>> >> > +1 For 'smart, aggregating hub generating a synthetic feed'
> > >>> >> > +1 For XRD discovery of the firehose endpoint.
>
> > >>> >> > Thinking a bit more about the firehose, what about making it more
> > >>> >> > flexible. Specifically, if we treat 'firehose' as any bundle of
> > >>> feeds
> > >>> >> > (all, or some), then a hub could define multiple firehose streams.
> > >>> For
> > >>> >> > example, at PostRank we classify feeds by topic, so if someone
> > >>> wanted
> > >>> >> > to subscribe to "Technology", we could expose that as a firehose so
> > >>> >> > the user doesn't have to subscribe to every feed in that topic. In
> > >>> >> > essence, a firehose stream is then any bundle of feeds.
>
> > >>> >> > This may be overloading the hub spec but the overall mechanics 
> > >>> >> > would
> > >>> >> > be:
> > >>> >> >  - A (super)user can declare a firehose endpoint
> > >>> >> >  - A (super)user is then able to add or remove subscriptions from
> > >>> the
> > >>> >> > firehose to create arbitrary aggregation streams
> > >>> >> >  - A subscriber uses XRD to discover the available aggregation
> > >>> streams
> > >>> >> >  - Firehose with 'all' feeds is a special case of the above, where
> > >>> all
> > >>> >> > feeds are present
>
> > >>> >> > This definitely adds more complexity into the hub... The 
> > >>> >> > alternative
> > >>> >> > is of course for the publisher to create a syndicated feed and
> > >>> publish
> > >>> >> > that directly as a standalone feed. Still trying to weight the up/
> > >>> >> > downsides in my head, but want to put it out there as an idea.
>
> > >>> >> > --------
> > >>> >> > Ilya Grigorik
> > >>> >> > postrank.com
>
> > >> --
> > >> Nick Johnson, Developer Programs Engineer, App Engine
> > >> Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration 
> > >> Number:
> > >> 368047
>
> > > --
> > > Marcus Herou CTO and co-founder Tailsweep AB
> > > +46702561312
> > > [email protected]
> > >http://www.tailsweep.com/

[pubsubhubbub] Re: PSHB Firehose

Reply via email to