[pubsubhubbub] Re: hubs and feed proxies doc

Dan Tue, 20 Jul 2010 21:07:58 -0700

Hi Julien,

Thanks for your response; you raise some excellent points. Comments
inline.

On Jul 15, 2:21 pm, Julien Genestoux <[email protected]>
wrote:
> Dan,
> Sorry for the late reply. First, I'd like to thank you and the Feedburner
> for the efforts.
> A few comments:
>
> - *When a hub crawls a proxied feed, the proxy should treat it as a ping.*
> I need to think about this a bit. Superfeedr acts as a "default" hub, which
> means that we work for any feed, by polling on behalf of the subscriber(s).
> This may cause issues, as we guarantee our susbrcibers that we will poll the
> feed at least once a day, to tell them about the status of the feed. I
> understand is goes a little bit beyond the regular PubSubHubbub approach,
> but we may need to implement something specific on our end so we don't
> identify our pollers as the hub when we're not the designated hub.
> Another solution would be that when the proxy gets a request from a hub, it
> should check that the hub is (or not) the designated hub.
>

If a hub crawls a feed once or a few times a day without actually
having received a ping, I don't think that should pose a problem for a
feed proxy or for the platform hosting the source feed -- it's not
very much traffic, and if the proxy winds up crawling the source feed,
it shouldn't cause any trouble.

What's important is that feed proxies don't crawl source feeds
constantly / every time they get a request for the feed, since that
hurts feed serving performance and can DOS the source feed platform.

> *When hub links appear in the source feed, proxies should subscribe to them.
> *
> As I agree with this, your comments about the proxy which may consider the
> fat ping as a light ping is worrysome, as there are cases where the source
> feed isn't updated yet (caches... ), and polling this (by not trusting the
> hub) could result in an inaccurate content and a delay as long as the proxy
> didn't poll the source feed while it hasn't been updated. [This is the
> behavior that Superfeedr (and I believe other subscribers) has had a for a
> long time, which caused us a lot of issues with Feedburner specifically].
>

I think this would lead to a problem when either:

- the hub crawls the source feed and gets updates, but the proxy
crawls it and sees stale content

or

- the platform sends a fat ping to the hub directly, but crawls of the
feed yield stale content

In an ideal platform, neither of these things should happen -- if
caches are being used by the platform hosting a source feed, they
should be invalidated when a new post appears in it and the ping to
the hub goes out, so that subsequent crawls will return the new post.
While FeedBurner used to have some problems in this area, they were
fixed several months ago. (Please let us know if you're still seeing
any problems with latency in updates.)

One reason that FeedBurner can't use the fat pings from hubs directly
is that the pings contain only deltas, and don't tell us:

- what items have fallen off the feed, as opposed to items that remain
unchanged but are still present in the feed
- where items that were added to the feed should be inserted relative
to existing, unmodified items in the feed

So it's not possible to reconstruct the source feed with the same
content you would get if you crawled it. This may not matter for some
applications, but it does matter to our application, since clients
crawling feeds expect the full feed.

Another reason is that if something goes wrong and we don't receive a
ping from a hub, and then receive another from it, we would miss new
items or updates sent in the first ping if we only applied the deltas
to state we maintained internally.

Crawling the source feed when we get a fat ping from a hub, caching
issues aside, ensures we get all of the latest content for the source
feed. So far this approach has been working well for us. Are there
other situations in particular that you're concerned about?

> Apart from this, it would be very interesting that proxies, when
> they subscribed to the source feed's designated hub, that they mention the
> number of susbcribers that they have themselves for the proxied feed. this
> way the original hub can report that number to the publisher. I understand
> there may be some dupes, but at least, there will be data. However, this is
> certainly a secondary issue.

Sounds like an excellent suggestion. Is this something we would embed
into the User Agent header for the request when making the hub
subscription? (I wasn't aware of a protocol for doing this in the
spec.)

>
> Thanks again for the hard work on this.
>
> Cheers,
>
> Julien
>
>
>
> On Wed, Jul 14, 2010 at 10:48 PM, Dan <[email protected]> wrote:
> > Hi all,
>
> > I wrote up a doc on the Pubsubhubbub wiki describing some problems
> > that arise getting hubs and feed proxies to work together nicely, and
> > an approach for solving those problems:
>
> >http://code.google.com/p/pubsubhubbub/wiki/HubsAndFeedProxies
>
> > The approach described is (roughly speaking) the approach taken by
> > FeedBurner and the reference hub.
>
> > Any comments you could provide would be great. One thing in particular
> > we need to figure out: what convention we should adopt for the HTTP
> > headers used when hubs crawl feeds. Some possible conventions are
> > described in the "Identifying hubs" section... please let me know what
> > you think a good convention would be.
>
> > Thanks!
>
> > Dan Rodney
> > FeedBurner

[pubsubhubbub] Re: hubs and feed proxies doc

Reply via email to