Re: PaceDuplicateIDWithSource2

Thomas Broyer Tue, 03 May 2005 15:51:05 -0700

Martin Duerst wrote:


I'm not really happy with this. Conceptually, it seems to replace
an ID for an entry with a pair (ID,feed). As IDs are URIs/IRIs
(remember what they expand to), that doesn't make sense.
What guarantee do we have that two feeds will be different?
(yes, these days, we get feeds over http, but there are other
ways to get feeds, where things may not be that easy).

If we don't have a solution for the malicious case, and we
think we need one, we should work on some security stuff.

If we think that accidential ID duplication is a problem,
then let's look at how we can improve the explanation.
After that, there may still be an occasional accident,
but the spec should be worded to catch that, not to
provide a loophole.

If we have to allow duplicate IDs, I'd rather prefer we
do it without all this feed/source/... stuff: I.e. if
you are an aggregator and can't manage to do duplicate
elimination, you can just delegate the problem to the
next place in the feeding chain.


OK, so, something like this (conceptually).

First, start with the "originating publisher":
 1. *You* MUST NOT use an atom:id for different entries (entries as
    resource, not as representations) in any of *your* feeds.
 2. *You* MUST NOT include the same entry (resource, not representation)
    more than once in a feed.

-> atom:id should be reworded to clarify that they identify an "entry resource", not each of its representations.

Next, what about aggregation? First thoughts: As atom:id are not guaranteed to be "globally unique", aggregated feeds might contain different entries (entry resources) with the same atom:id. As they are aggregated entries, they should (I'd also wish it were a MUST) have an atom:source. So we can say that a feed is allowed to include different entries sharing the same atom:id as long as at most one of them doesn't have an atom:source). This is also a way to enforce the second item of the above list in the "publishers' camp".

-> please, change the SHOULD on atom:source to a MUST

One step further: aggregation of aggregated feeds. We didn't handled the case where a single entry can appear more than once in an aggregated feed because of different aggregation routes. There, we need to identify and distinguish feeds from each other, so yes we're moving to an identifier pair (feed+entry). So we have to look at the atom:source and its content. However, we didn't solve the problem as, IMO, we shouldn't require an entry to have a single source feed. Publishers should be allowed to publish an entry in several feeds (main + category) without the atom:source hassle. atom:id is also not a required metadata of a feed. Suggestion: "delegate the problem to the next place in the feeding chain". This will push delegation 'til the end-user's aggregator where he might have settings on how to treat entries with the same atom:id but different feed's atom:id.

Finally, what about atom:id at the feed level? IMO, different feed "flavors" (full-content, summary-only, title-only) should use the same atom:id as a mean to say "I guaranty that these are the same (e.g. they are updated simultaneously and one is not a subset/superset of another), with different levels of richness". It should be the same for last-10-entries vs. last-50-entries feeds, as long as their [insert lesser denominator here, e.g. 10] most recent entries (based on their atom:updated) are the same (based on their atom:id), it'd then be a mean to say "I guaranty that these are the same (e.g. they are updated simultaneously), with different history lengths".

Oh, and last but not least: these considerations on "entry identification" should IMO appear in an informative appendix. The normative part of the spec would just say a feed SHOULD NOT include more than one entry with a given atom:id and then point to the appendix for more about some ways of identifying entry duplicates.

--
Thomas Broyer

Re: PaceDuplicateIDWithSource2

Reply via email to