Re: PaceAllowDuplicateIDs

David M Johnson Thu, 05 May 2005 08:33:17 -0700


I'm -1 on PaceAllowDuplicateIDs

Reasons:

1) We're supposed to be standardizing current practice not inventing new things. Current best practice is to have unique IDs and current software (e.g. Javablogs.com) is predicated on this practice. I know, this practice is not followed widely enough, but that is another matter.

2) I think it is *much* more useful to think of an Atom Entry as an event that occurred at a specific time. Typically, an event is the publication of an article or blog entry on the web. For example:

   event: CNET published article
   subject: CNET
   object: article

But an event it could also represent other events.

   event: delivery van delivers package
   subject: delivery van
   object: package

   event: alarm system sends warning
   subject: alarm system
   object: warning

   event: server sends load warning
   subject: server
   object: load warning

If you think of Atom Entries as events, then it makes sense to consider the Atom Entry ID to be the ID of the event, not the ID of the subject or object of the event. Events are unique (you can't have more than one version of an event) and can be assigned GUIDs and therefore you cannot have more than one entry with the same ID.

In the case of earthquake data, each new data report is a new event.

   event: agency reports earthquake data
   subject: agency
   object: earthquake data

The ID is the ID of the "data reported" event not the ID of the earthquake.

We don't know what subjects and objects people are going to use in the future, so we can't specify Atom elements or IDs for subjects and objects -- that's what extensions are for. If you want to create a feed to syndicate information about earthquakes, then you introduce an extension for uniquely identifying earthquakes. The same goes for earthquakes.

- Dave

On May 5, 2005, at 12:02 AM, Tim Bray wrote:

<co-chair-hat status="OFF">
http://www.intertwingly.net/wiki/pie/PaceAllowDuplicateIDs
This Pace was motivated by a talk I had with Bob Wyman today about the problems the synthofeed-generator community has.
Summary:
1. There are multiple plausible use-cases for feeds with duplicate IDs
2. Pro and Contra
3. Alternate Paces
4. Details about this Pace
1. Use-Cases
Here's a stream of stock-market quotes.
<feed><title>My Portfolio</title>
 ....
 <entry><title>MSFT</title>
  <updated>2005-05-03T10:00:00-05:00</updated>
  <content>Bid: 25.20 Ask: 25.50 Last: 25.20</content></item>
  </entry>
 <entry><title>MSFT</title>
  <updated>2005-05-03T11:00:00-05:00</updated>
  <content>Bid: 25.15 Ask: 25.25 Last: 25.20</content></item>
  </entry>
 <entry><title>MSFT</title>
  <updated>2005-05-03T12:00:00-05:00</updated>
  <content>Bid: 25.10 Ask: 25.15 Last: 25.10</content></item>
  </entry>
</feed>
You could also imagine a stream of weather readings. Bob's actual here-and-now today use-case from PubSub is earthquakes, an entry describes an earthquake and they keep re-issuing it as new info about strength/location comes in.

Some people only care about the most recent version of the entry, others might want to see all of them. Basically, each atom:entry element describes the same Entry, only at a different point in time.

You could argue that in some cases, these are representations of the Web resources identified by the atom:id URI, but I don't think we need to say that explicitly.

Yes, you could think of alternate ways of representing stock quotes or any of the other use-cases but this is simple and direct and idiomatic.
2. Pro and Contra
Given that I issued the consensus call rejecting the last attempt to do this, which was PaceRepeatIdInDocument, I felt nervous about revisiting the issue. So I went and reviewed the discussion around that one, which I extracted and placed at http://www.tbray.org/tmp/RepeatID.txt for the WG's convenience.

Reviewing that discussion, I'm actually not impressed. There were a few -1's but very few actual technical arguments about why this shouldn't be done. The most common was "Software will screw this up". On reflection, I don't believe that. You have a bunch of Entries, some of them have the same ID and are distinguished by datestamp. Some software will show the latest, some will show all of them, the good software will allow switching back and forth. Doesn't seem like rocket science to me.

So here's how I see it: there are plausible use cases for doing this, and one of the leading really large-scale implementors in the space (PubSub) wants to do this right now. Bob's been making strong claims about not being able to use Atom if this restriction remains in place.

I believe strongly that if there's something that implementors want to do, standards shouldn't get in the way unless there's real interoperability damage. I'm certainly prepared to believe that this could cause interoperability damage, but to date I haven't seen any convincing arguments that it will. I think that if we nonetheless forbid it, people who want to do this will (a) use RSS instead of Atom, (b) cook up horrible kludges, or (c) ignore us and just do it.

So my best estimate is that the cost of allowing dupes is probably much lower than the cost of forbidding them.

Finally, our charter does say that we're also supposed to specify how you'd go about archiving feeds, and AllowDuplicateIDs makes this trivial. I looked around and failed to find how we claimed we were going to do that while still forbidding duplicates, but it's possible I missed that.
3. Alternate Paces
I didn't want to just revive PaceRepeatIdInDocument, because it used the word "version" in what I thought was kind of a sloppy way, and because it wasn't current against format-08. I don't like either PaceDuplicateIDWithSource or ...WithSource2, they are complicated and don't really meet PubSub's needs anyhow. So I'm strongly -1 on both of those. Yes, that means that if this Pace fails, we'll allow no duplicates at all. I prefer either "dupes OK" or "no dupes" to "dupes OK in the following circumstances"; cleaner.
4. Details
Section 4.1.2 of format-08 says that atom:entry "represents an individual entry". The Pace says that if you have dupes, they "represent the same entry", which I think is consistent with both the letter and spirit of 4.1.2.

The Pace discourages duplicate timestamps without resorting to MUST language, because accidents can happen; this allows software to throw such entries on the floor while positively encouraging noisy complaining. On the other hand, if the WG wanted either to insist on a MUST here or remove the discouragement altogether I could live with that.

Finally, it makes it clear that if there are entries with duplicate atom:id, software is free to display all or a subset, and calls out the likely common case where you discard all but the most recent. If I were Brent Simmons or equivalent, I'd be coding up a button where you can arrange to show them all or just one.
</co-chair-hat>

Re: PaceAllowDuplicateIDs

Reply via email to