One reason we have duplicates entries is that we have duplicate feeds...

Bob Wyman Thu, 07 Apr 2005 10:42:02 -0700

       As many are aware, the Tim Bray[1] and others have recently been
railing on the subject of duplicate entries being delivered by aggregation
services and/or search engines like PubSub, Technorati, Feedster, etc. As
unexpected as this may sound, I am quite confident that when Atom V1.0 is
released and starts to be used, the problem of duplicates will get worse --
unless something (probably quite simple) is done. A major part of the
problem with duplicate entries is that people publish duplicate feeds. The
introduction of Atom V1.0 will result in even more duplicate feeds being
produced... This is not because of "bugs" in the producers of synthetic
feeds.


        A major cause of duplicates in at least some of the existing
services is the fact that bloggers insist on engaging in the apparently
illogical and wasteful practice of publish multiple versions of their feeds
and thus duplicates of their entries or items. 
        Often blogs will offer at least one flavor of RSS as well as Atom.
Those that don't offer Atom will often offer at least two flavors of RSS.
Some sites provide as many as six or seven distinct versions of their feeds
in addition to a sometimes large number of "category" feeds (each of which
is offered in multiple encodings) that contain nothing but duplicates of
entries/items found in some primary feed. Unfortunately, while RSS and Atom
both contain mechanisms to identify their "HTML" alternates, there isn't any
clear mechanism available to discover alternate feeds. In theory, if such
means existed, automated processors of large numbers of feeds could
selectively ignore duplicate feeds...
        The process of recognizing duplicates within a single feed is
relatively straight forward and is widely implemented not only by news
aggregator clients but also by the aggregating services. (I believe the most
common method is to generate an MD5 hash for each entry/item and doing the
obvious processing.) Problems arise, however, when one attempts to recognize
duplicates across feeds which may have different information-carrying
capacities (i.e. RSS, RDF and Atom are different). Recognizing duplicate
feeds is made particularly difficult if only because of the practice that
many sites have of producing "category" feeds that are organized to present
different selections of the posts for a single blog. The fact that an entry
has been published in one or another "category feed" is, in itself,
information that may be useful to some readers. In any case, there is no way
without using human, meat-machine based inspection to determine which of
many feeds is the "primary" feed...
        It should also be noted that if an aggregation service (such as
PubSub) allows users to subscribe to content based on the "source feed" of
the entry, the aggregation service is forced to read all the feeds that it
can find -- since any findable feed is something that some user might want
subscribe to... This issue becomes particularly troubling in contexts like
FeedMesh where aggregators are expected to share their knowledge of feed
updates with each other. Issues might arise if some aggregators are
selectively ignoring feeds or processing the feeds but not publishing their
results to the FeedMesh.
        Even if the aggregator could establish that two feeds carried
identical content, it would be difficult to stop processing one of the feeds
since someone might wish to subscribe to it... Admittedly, one can imagine
that an aggregator might install a rule that says: "Any subscription to feed
A should be read as a subscription to feed B." However, that would require
the aggregator to assume that simply because two feeds had substantially
identical content in the past, that they would have identical content in the
future. This is not an easy assumption to make without some explicit
indication of feed equivalence from the producer of the feed. We could go
further and require that the aggregator constantly "check" its assumptions
and if the two feeds diverge then uninstall the rule. This is getting
complex... Too complex.
        Tim suggests that aggregators should be able to rely simply on
atom:id to detect duplicates. However, as has often been pointed out,
applying this rule in an intermediary like PubSub would simply make PubSub a
marvelously efficient tool for denial of service attacks. I.e. if I didn't
like something you published, I would simply publish something in my blog
that had the same atom:id as something you had published. PubSub and other
synthetic feed producers would then flush your post from the system and
replace it with my post... Not good -- and not avoidable given the current
loose rules for defining instances of atom:id.
        This issue of duplicate detection has come up many times in this
forum and has never been dealt with seriously as far as I can tell. The
current Atom draft continues to fail to address the issue. Simply
complaining isn't getting us anywhere even as the introduction of Atom V1.0
makes the problem worse.
        One thing that we could do in the short term to reduce the number of
duplicate feeds is to define metadata that would allow feed publishers to
describe the relationships between feeds and describe the content of
different feeds. If, when reading a feed, I could be informed that this feed
was a duplicate of another "preferred" feed, I could switch to the preferred
feed and stop reading the duplicate. I could use this knowledge to map
subscriptions to the duplicate feeds to their equivalents. This would be
wonderful in the transition from Atom .03 to Atom V1.0... (i.e. if the old
format feed could point to the newer version, we'd all eventually stop
reading the old format feed.)
        For instance:

<link rel="preferred-feed" type="application/atom+xml"
href="http://example.org/feed.xml"/>
<link rel="subset-of" type="application/atom+xml"
href="http://example.org/fullfeed.xml"/>

        If these link rel types are acceptable to people, we should probably
define equivelant syntax to be inserted into old Atom feeds to allow the
transition to the new atom as well as define extensions to the various
flavors of RSS. If we can implement this minor extension, we may see many
fewer duplicate feeds being consumed and thus fewer duplicate entries.
        Comments?

        bob wyman

[1] http://www.tbray.org/ongoing/When/200x/2005/04/03/Atom-Now

One reason we have duplicates entries is that we have duplicate feeds...

Reply via email to