[pubsubhubbub] Re: PSHB Firehose

Nick Johnson (Google) Thu, 22 Oct 2009 06:41:11 -0700

On Thu, Oct 22, 2009 at 2:22 PM, igrigorik <[email protected]> wrote:

>
> Ok, so let me try rephrasing the problem. The major problem is not the
> Hub, or the spec, but the need to crawl thousands of sites to find the
> feeds in the first place. Of course, the hub already knows about them,
> hence this discussion.
>
> Would it be crazy to then expose a mechanism to enumerate all of the
> feeds that a hub tracks, and let the client then subscribe to them?
>


The other obstacle would be that the hub only 'knows' about feeds that have
at least one subscriber, and doesn't know when a feed goes away.

-Nick


>
> An obvious obstacle is private feeds, but I wonder if that can be
> handled as a special case?
>
> ig
>
> On Oct 21, 1:55 pm, Marcus Herou <[email protected]> wrote:
> > Well... We can as well publish a firehose but currently our business
> > model is not aimed at that.
> >
> > I was not talking about a service rather a technology which could take
> > the crawling business to another level by aggregating hundreds of hubs
> > and creating something which effectively can deliver tb/s bandwidth by
> > having decentralized servers and data. We are still limited you know
> > by our infrastructure even though it have gb/s to the internet.
> >
> > Since the realtime web is currently still very small all of us need to
> > poll something even you I presume to be able to create a pub/sub arch.
> > In that remark our companies are quite similar, you chose to aggregate
> > and publish your data and make a business of it. We aggregate and
> > refine the data and make business out of that.
> >
> > Dont take me wrong I really like what you do but i am not looking for
> > a data supplier at this time ( might change though ). But if I would
> > look in the data supplier direction you are currently in my/our top
> > ten list :)
> >
> > Skickat från min iPhone
> >
> > On Oct 21, 5:03 pm, Julien Genestoux <[email protected]>
> > wrote:
> >
> > > Hum...http://superfeedr.com?
> >
> > > "Putting ressources in common" is definetely one of the key reasons why
> we
> > > built superfeedr. More about that there :
> http://blog.superfeedr.com/gospel/something-stupid/
> >
> > > And yes, we have a firehose available.
> >
> > > Julien
> >
> > > --
> > > Julien Genestoux,
> >
> > >http://twitter.com/julien51http://superfeedr.com
> >
> > > +1 (415) 254 7340
> > > +33 (0)9 70 44 76 29
> >
> > > On Wed, Oct 21, 2009 at 5:26 AM, Marcus Herou <
> [email protected]>wrote:
> >
> > > > Feedtree looks cool.... but updated 2006 ?
> >
> > > > On Wed, Oct 21, 2009 at 2:20 PM, Nick Johnson (Google) <
> > > > [email protected]> wrote:
> >
> > > >> On Wed, Oct 21, 2009 at 1:14 PM, Alexis Richardson <
> > > >> [email protected]> wrote:
> >
> > > >>> Hmmm ... gossiptorrent?
> >
> > > >> Feedtree.
> >
> > > >>> On Wed, Oct 21, 2009 at 7:23 AM, Marcus Herou
> > > >>> <[email protected]> wrote:
> >
> > > >>> > Hi.
> >
> > > >>> > We host a search app which is based on feeds of
> blogs/twitter/forums/
> > > >>> > news etc. We are as you are mentioning polling everything like
> crazy
> > > >>> > and it seems like a total waste of everyones resources.
> >
> > > >>> > So this means that subscribing to something which would
> potentially
> > > >>> > deliver the material to us would be great not just for us but as
> well
> > > >>> > all sites we are crawling.
> >
> > > >>> > However who would like to open up a firehose for free for
> everyone to
> > > >>> > consume ? It will for sure consume a lot of bandwidth and a few
> > > >>> > subscribers will consume most of the bandwidth with this model.
> > > >>> > I thought of something that might solve this issue. Consider the
> > > >>> > following:
> >
> > > >>> > 1)
> > > >>> > * Charge for the bandwidth (wordpress.com does this with flat
> fee)
> >
> > > >>> > 2)
> > > >>> > * Everyone that have firehose consuming needs should as well
> start a
> > > >>> > hub to show good faith and morale.
> > > >>> > * Add support in firehose enabled hubs to share state (with a
> > > >>> > master ?)
> > > >>> > * A firehose enabled hub can subscribe to a master hub which
> makes
> > > >>> > sure that the subscriber as well fulfils some form of contract
> (i.e.
> > > >>> > actually updating/delivering feeds)
> > > >>> > * Each firehose enabled hub must be public and everyone can
> subscribe
> > > >>> > to the feeds like as of current.
> > > >>> > * To share load equally (morale part) then subscribers should
> > > >>> > subscribe to a loadbalanced dns name or some form of delegate
> > > >>> >  lb.pshb.com = master hub
> > > >>> >  Example 1: lb.pshb.com resolves to pshb.tailsweep.com
> > > >>> > pshb.google.com, effectively DNS-roundrobin
> > > >>> >  Example 2: lb.pshb.com delegates to any active master connected
> hub
> > > >>> > in some way.
> >
> > > >>> > This might be too complex to implement and bottlenecks occur at
> the
> > > >>> > master but systems like Hadoop have bottlenecks in terms of the
> > > >>> > NameNode (master) and it seems to perform just perfect so it can
> be
> > > >>> > done. However each firehose hub probably need to persist each
> feed for
> > > >>> > a certain amount of time before purging it.
> >
> > > >>> > Anyway this was just a thought. We at Tailsweep probably could
> help in
> > > >>> > making this happen if there exists some interest.
> >
> > > >>> > Cheers
> >
> > > >>> > //Marcus
> >
> > > >>> > On Oct 20, 8:41 pm, Bob Wyman <[email protected]> wrote:
> > > >>> >> On Tue, Oct 20, 2009 at 11:22 AM, igrigorik <[email protected]>
> wrote:
> > > >>> >> > Specifically, if we treat 'firehose' as any bundle of
> > > >>> >> > feeds (all, or some), then a hub could define
> > > >>> >> > multiple firehose streams.
> >
> > > >>> >> There should be no question that there is tremendous utility in
> being
> > > >>> able
> > > >>> >> to compose all sorts of "bundles" of topics into distinct feeds.
> It is
> > > >>> >> probably also the case that we can identify some number of such
> > > >>> bundles that
> > > >>> >> would be useful to a large number of subscribers. On the other
> hand,
> > > >>> many
> > > >>> >> bundles will be very specific and only useful to one or a small
> number
> > > >>> of
> > > >>> >> subscribers. In fact, I think what we'll see is that once we
> have the
> > > >>> core
> > > >>> >> PSHB defined, we'll then see innovation in the definition of
> "down
> > > >>> stream"
> > > >>> >> services whose function is precisely to build and deliver such
> > > >>> bundles. Some
> > > >>> >> of these services will aggregate groups of topics while others
> will
> > > >>> focus
> > > >>> >> instead on creating content-based streams -- they will bundle
> together
> > > >>> >> individual entries based on the content of those entries rather
> than
> > > >>> simply
> > > >>> >> combining all entries from some set of topics.
> >
> > > >>> >> I think we should be careful not to force too much of the burden
> of
> > > >>> bundling
> > > >>> >> or aggregating into the core PSHB hub specification. If we want
> to
> > > >>> address
> > > >>> >> the challenges of building bundles or aggregations, I think it
> best to
> > > >>> do so
> > > >>> >> in secondary or companion specifications. This will keep the
> core
> > > >>> cleaner
> > > >>> >> and easy to understand while also allowing the core to be
> deployed
> > > >>> without
> > > >>> >> being delayed by discussions over non-core issues.
> >
> > > >>> >> Having argued against making the core more complicated by
> extending it
> > > >>> to
> > > >>> >> include creating aggregate topics, I still suggest that it would
> be
> > > >>> useful
> > > >>> >> to have the core system define a common means to obtain a pure
> > > >>> "firehose"
> > > >>> >> feed of all topics. The current hub spec works for people who
> only
> > > >>> want
> > > >>> >> "none or some" of the topics served by the hub. I suggest that
> we
> > > >>> expand
> > > >>> >> this to have hubs know how to provide "none, some or all" of the
> > > >>> topics.
> > > >>> >> The reason for adding support of "all topics" is that we know,
> without
> > > >>> much
> > > >>> >> question, that such an "all topics" feed will be required by
> many of
> > > >>> the
> > > >>> >> downstream services that we will one day be relying on to create
> more
> > > >>> finely
> > > >>> >> defined aggregations. Given that this specific feed will be
> commonly
> > > >>> >> required, it would be best if we had a common mechanism for a
> > > >>> downstream
> > > >>> >> service/subscriber to request that feed and that we set some
> > > >>> expectations
> > > >>> >> for how that feed will be formatted and delivered (i.e. Atom
> entries,
> > > >>> >> persistent connections, chunked content model, ...). It would be
> very
> > > >>> >> cumbersome for a downstream filtering/aggregating service to
> need to
> > > >>> puzzle
> > > >>> >> through service specific mechanisms for discovering how to
> obtain a
> > > >>> firehose
> > > >>> >> feed of "all topics" from many different hubs.
> >
> > > >>> >> bob wyman
> >
> > > >>> >> On Tue, Oct 20, 2009 at 11:22 AM, igrigorik <[email protected]>
> wrote:
> >
> > > >>> >> > Right, so how does the smart hub aggregate the feeds? Does it
> then
> > > >>> >> > have to crawl to find the list? That wouldn't be very useful.
> Having
> > > >>> >> > said that...
> >
> > > >>> >> > +1 For 'smart, aggregating hub generating a synthetic feed'
> > > >>> >> > +1 For XRD discovery of the firehose endpoint.
> >
> > > >>> >> > Thinking a bit more about the firehose, what about making it
> more
> > > >>> >> > flexible. Specifically, if we treat 'firehose' as any bundle
> of
> > > >>> feeds
> > > >>> >> > (all, or some), then a hub could define multiple firehose
> streams.
> > > >>> For
> > > >>> >> > example, at PostRank we classify feeds by topic, so if someone
> > > >>> wanted
> > > >>> >> > to subscribe to "Technology", we could expose that as a
> firehose so
> > > >>> >> > the user doesn't have to subscribe to every feed in that
> topic. In
> > > >>> >> > essence, a firehose stream is then any bundle of feeds.
> >
> > > >>> >> > This may be overloading the hub spec but the overall mechanics
> would
> > > >>> >> > be:
> > > >>> >> >  - A (super)user can declare a firehose endpoint
> > > >>> >> >  - A (super)user is then able to add or remove subscriptions
> from
> > > >>> the
> > > >>> >> > firehose to create arbitrary aggregation streams
> > > >>> >> >  - A subscriber uses XRD to discover the available aggregation
> > > >>> streams
> > > >>> >> >  - Firehose with 'all' feeds is a special case of the above,
> where
> > > >>> all
> > > >>> >> > feeds are present
> >
> > > >>> >> > This definitely adds more complexity into the hub... The
> alternative
> > > >>> >> > is of course for the publisher to create a syndicated feed and
> > > >>> publish
> > > >>> >> > that directly as a standalone feed. Still trying to weight the
> up/
> > > >>> >> > downsides in my head, but want to put it out there as an idea.
> >
> > > >>> >> > --------
> > > >>> >> > Ilya Grigorik
> > > >>> >> > postrank.com
> >
> > > >> --
> > > >> Nick Johnson, Developer Programs Engineer, App Engine
> > > >> Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration
> Number:
> > > >> 368047
> >
> > > > --
> > > > Marcus Herou CTO and co-founder Tailsweep AB
> > > > +46702561312
> > > > [email protected]
> > > >http://www.tailsweep.com/
>



-- 
Nick Johnson, Developer Programs Engineer, App Engine
Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number:
368047

[pubsubhubbub] Re: PSHB Firehose

Reply via email to