[pubsubhubbub] Re: PSHB Firehose

Bob Wyman Thu, 22 Oct 2009 13:06:16 -0700

On Thu, Oct 22, 2009 at 9:22 AM, igrigorik <[email protected]> wrote:
> Would it be crazy to then expose a mechanism to
> enumerate all of the feeds that a hub tracks, and let
> the client then subscribe to them?


I think that if I was a spammer, I would strongly support a mechanism to
enumerate feeds. That would allow me to easily scan to determine which of my
feeds had been removed as spam and which were still useful. Certainly, I
could "test" my feeds by actually publishing data, however, I would prefer
not to do that since my testing might, in fact, trigger some spam detection
code. Thus, I would like to have a mechanism to check my spam feeds that
didn't involve actually publishing through them. Using this mechanism, I
might do things like slowly build up a set of inactive but "non-spam" feeds
whose "reputation" would probably grow as they live spam-free for a longer
time. I might then burst a whole bunch of spam through some sub-set of my
"well-aged" feeds from time to time.

Of course, a hub owner could make my life as a spammer just a little bit
more difficult by including all known spam feeds in the enumerated list.
But, I'm thinking that hub managers wouldn't do that since non-spammers
would complain too much. One problem is that the list would grow
indefinitely, the other is that other people would be keep asking the hub to
stop pestering them by listing known spam feeds. In essence, my job as a
spammer would become easier as the hub owners tried to make life easier for
the non-spammers... Help yourself and you help me.

bob wyman (not a spammer -- just imagining...)

On Thu, Oct 22, 2009 at 9:22 AM, igrigorik <[email protected]> wrote:

>
> Ok, so let me try rephrasing the problem. The major problem is not the
> Hub, or the spec, but the need to crawl thousands of sites to find the
> feeds in the first place. Of course, the hub already knows about them,
> hence this discussion.
>
> Would it be crazy to then expose a mechanism to enumerate all of the
> feeds that a hub tracks, and let the client then subscribe to them?
>
> An obvious obstacle is private feeds, but I wonder if that can be
> handled as a special case?
>
> ig
>
> On Oct 21, 1:55 pm, Marcus Herou <[email protected]> wrote:
> > Well... We can as well publish a firehose but currently our business
> > model is not aimed at that.
> >
> > I was not talking about a service rather a technology which could take
> > the crawling business to another level by aggregating hundreds of hubs
> > and creating something which effectively can deliver tb/s bandwidth by
> > having decentralized servers and data. We are still limited you know
> > by our infrastructure even though it have gb/s to the internet.
> >
> > Since the realtime web is currently still very small all of us need to
> > poll something even you I presume to be able to create a pub/sub arch.
> > In that remark our companies are quite similar, you chose to aggregate
> > and publish your data and make a business of it. We aggregate and
> > refine the data and make business out of that.
> >
> > Dont take me wrong I really like what you do but i am not looking for
> > a data supplier at this time ( might change though ). But if I would
> > look in the data supplier direction you are currently in my/our top
> > ten list :)
> >
> > Skickat från min iPhone
> >
> > On Oct 21, 5:03 pm, Julien Genestoux <[email protected]>
> > wrote:
> >
> > > Hum...http://superfeedr.com?
> >
> > > "Putting ressources in common" is definetely one of the key reasons why
> we
> > > built superfeedr. More about that there :
> http://blog.superfeedr.com/gospel/something-stupid/
> >
> > > And yes, we have a firehose available.
> >
> > > Julien
> >
> > > --
> > > Julien Genestoux,
> >
> > >http://twitter.com/julien51http://superfeedr.com
> >
> > > +1 (415) 254 7340
> > > +33 (0)9 70 44 76 29
> >
> > > On Wed, Oct 21, 2009 at 5:26 AM, Marcus Herou <
> [email protected]>wrote:
> >
> > > > Feedtree looks cool.... but updated 2006 ?
> >
> > > > On Wed, Oct 21, 2009 at 2:20 PM, Nick Johnson (Google) <
> > > > [email protected]> wrote:
> >
> > > >> On Wed, Oct 21, 2009 at 1:14 PM, Alexis Richardson <
> > > >> [email protected]> wrote:
> >
> > > >>> Hmmm ... gossiptorrent?
> >
> > > >> Feedtree.
> >
> > > >>> On Wed, Oct 21, 2009 at 7:23 AM, Marcus Herou
> > > >>> <[email protected]> wrote:
> >
> > > >>> > Hi.
> >
> > > >>> > We host a search app which is based on feeds of
> blogs/twitter/forums/
> > > >>> > news etc. We are as you are mentioning polling everything like
> crazy
> > > >>> > and it seems like a total waste of everyones resources.
> >
> > > >>> > So this means that subscribing to something which would
> potentially
> > > >>> > deliver the material to us would be great not just for us but as
> well
> > > >>> > all sites we are crawling.
> >
> > > >>> > However who would like to open up a firehose for free for
> everyone to
> > > >>> > consume ? It will for sure consume a lot of bandwidth and a few
> > > >>> > subscribers will consume most of the bandwidth with this model.
> > > >>> > I thought of something that might solve this issue. Consider the
> > > >>> > following:
> >
> > > >>> > 1)
> > > >>> > * Charge for the bandwidth (wordpress.com does this with flat
> fee)
> >
> > > >>> > 2)
> > > >>> > * Everyone that have firehose consuming needs should as well
> start a
> > > >>> > hub to show good faith and morale.
> > > >>> > * Add support in firehose enabled hubs to share state (with a
> > > >>> > master ?)
> > > >>> > * A firehose enabled hub can subscribe to a master hub which
> makes
> > > >>> > sure that the subscriber as well fulfils some form of contract
> (i.e.
> > > >>> > actually updating/delivering feeds)
> > > >>> > * Each firehose enabled hub must be public and everyone can
> subscribe
> > > >>> > to the feeds like as of current.
> > > >>> > * To share load equally (morale part) then subscribers should
> > > >>> > subscribe to a loadbalanced dns name or some form of delegate
> > > >>> >  lb.pshb.com = master hub
> > > >>> >  Example 1: lb.pshb.com resolves to pshb.tailsweep.com
> > > >>> > pshb.google.com, effectively DNS-roundrobin
> > > >>> >  Example 2: lb.pshb.com delegates to any active master connected
> hub
> > > >>> > in some way.
> >
> > > >>> > This might be too complex to implement and bottlenecks occur at
> the
> > > >>> > master but systems like Hadoop have bottlenecks in terms of the
> > > >>> > NameNode (master) and it seems to perform just perfect so it can
> be
> > > >>> > done. However each firehose hub probably need to persist each
> feed for
> > > >>> > a certain amount of time before purging it.
> >
> > > >>> > Anyway this was just a thought. We at Tailsweep probably could
> help in
> > > >>> > making this happen if there exists some interest.
> >
> > > >>> > Cheers
> >
> > > >>> > //Marcus
> >
> > > >>> > On Oct 20, 8:41 pm, Bob Wyman <[email protected]> wrote:
> > > >>> >> On Tue, Oct 20, 2009 at 11:22 AM, igrigorik <[email protected]>
> wrote:
> > > >>> >> > Specifically, if we treat 'firehose' as any bundle of
> > > >>> >> > feeds (all, or some), then a hub could define
> > > >>> >> > multiple firehose streams.
> >
> > > >>> >> There should be no question that there is tremendous utility in
> being
> > > >>> able
> > > >>> >> to compose all sorts of "bundles" of topics into distinct feeds.
> It is
> > > >>> >> probably also the case that we can identify some number of such
> > > >>> bundles that
> > > >>> >> would be useful to a large number of subscribers. On the other
> hand,
> > > >>> many
> > > >>> >> bundles will be very specific and only useful to one or a small
> number
> > > >>> of
> > > >>> >> subscribers. In fact, I think what we'll see is that once we
> have the
> > > >>> core
> > > >>> >> PSHB defined, we'll then see innovation in the definition of
> "down
> > > >>> stream"
> > > >>> >> services whose function is precisely to build and deliver such
> > > >>> bundles. Some
> > > >>> >> of these services will aggregate groups of topics while others
> will
> > > >>> focus
> > > >>> >> instead on creating content-based streams -- they will bundle
> together
> > > >>> >> individual entries based on the content of those entries rather
> than
> > > >>> simply
> > > >>> >> combining all entries from some set of topics.
> >
> > > >>> >> I think we should be careful not to force too much of the burden
> of
> > > >>> bundling
> > > >>> >> or aggregating into the core PSHB hub specification. If we want
> to
> > > >>> address
> > > >>> >> the challenges of building bundles or aggregations, I think it
> best to
> > > >>> do so
> > > >>> >> in secondary or companion specifications. This will keep the
> core
> > > >>> cleaner
> > > >>> >> and easy to understand while also allowing the core to be
> deployed
> > > >>> without
> > > >>> >> being delayed by discussions over non-core issues.
> >
> > > >>> >> Having argued against making the core more complicated by
> extending it
> > > >>> to
> > > >>> >> include creating aggregate topics, I still suggest that it would
> be
> > > >>> useful
> > > >>> >> to have the core system define a common means to obtain a pure
> > > >>> "firehose"
> > > >>> >> feed of all topics. The current hub spec works for people who
> only
> > > >>> want
> > > >>> >> "none or some" of the topics served by the hub. I suggest that
> we
> > > >>> expand
> > > >>> >> this to have hubs know how to provide "none, some or all" of the
> > > >>> topics.
> > > >>> >> The reason for adding support of "all topics" is that we know,
> without
> > > >>> much
> > > >>> >> question, that such an "all topics" feed will be required by
> many of
> > > >>> the
> > > >>> >> downstream services that we will one day be relying on to create
> more
> > > >>> finely
> > > >>> >> defined aggregations. Given that this specific feed will be
> commonly
> > > >>> >> required, it would be best if we had a common mechanism for a
> > > >>> downstream
> > > >>> >> service/subscriber to request that feed and that we set some
> > > >>> expectations
> > > >>> >> for how that feed will be formatted and delivered (i.e. Atom
> entries,
> > > >>> >> persistent connections, chunked content model, ...). It would be
> very
> > > >>> >> cumbersome for a downstream filtering/aggregating service to
> need to
> > > >>> puzzle
> > > >>> >> through service specific mechanisms for discovering how to
> obtain a
> > > >>> firehose
> > > >>> >> feed of "all topics" from many different hubs.
> >
> > > >>> >> bob wyman
> >
> > > >>> >> On Tue, Oct 20, 2009 at 11:22 AM, igrigorik <[email protected]>
> wrote:
> >
> > > >>> >> > Right, so how does the smart hub aggregate the feeds? Does it
> then
> > > >>> >> > have to crawl to find the list? That wouldn't be very useful.
> Having
> > > >>> >> > said that...
> >
> > > >>> >> > +1 For 'smart, aggregating hub generating a synthetic feed'
> > > >>> >> > +1 For XRD discovery of the firehose endpoint.
> >
> > > >>> >> > Thinking a bit more about the firehose, what about making it
> more
> > > >>> >> > flexible. Specifically, if we treat 'firehose' as any bundle
> of
> > > >>> feeds
> > > >>> >> > (all, or some), then a hub could define multiple firehose
> streams.
> > > >>> For
> > > >>> >> > example, at PostRank we classify feeds by topic, so if someone
> > > >>> wanted
> > > >>> >> > to subscribe to "Technology", we could expose that as a
> firehose so
> > > >>> >> > the user doesn't have to subscribe to every feed in that
> topic. In
> > > >>> >> > essence, a firehose stream is then any bundle of feeds.
> >
> > > >>> >> > This may be overloading the hub spec but the overall mechanics
> would
> > > >>> >> > be:
> > > >>> >> >  - A (super)user can declare a firehose endpoint
> > > >>> >> >  - A (super)user is then able to add or remove subscriptions
> from
> > > >>> the
> > > >>> >> > firehose to create arbitrary aggregation streams
> > > >>> >> >  - A subscriber uses XRD to discover the available aggregation
> > > >>> streams
> > > >>> >> >  - Firehose with 'all' feeds is a special case of the above,
> where
> > > >>> all
> > > >>> >> > feeds are present
> >
> > > >>> >> > This definitely adds more complexity into the hub... The
> alternative
> > > >>> >> > is of course for the publisher to create a syndicated feed and
> > > >>> publish
> > > >>> >> > that directly as a standalone feed. Still trying to weight the
> up/
> > > >>> >> > downsides in my head, but want to put it out there as an idea.
> >
> > > >>> >> > --------
> > > >>> >> > Ilya Grigorik
> > > >>> >> > postrank.com
> >
> > > >> --
> > > >> Nick Johnson, Developer Programs Engineer, App Engine
> > > >> Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration
> Number:
> > > >> 368047
> >
> > > > --
> > > > Marcus Herou CTO and co-founder Tailsweep AB
> > > > +46702561312
> > > > [email protected]
> > > >http://www.tailsweep.com/
>

[pubsubhubbub] Re: PSHB Firehose

Reply via email to