Thanks for the insight, Brian. Comments inline.
On Wed, Jan 14, 2009 at 2:10 PM, Brian Suda <[email protected]> wrote: > On 1/14/09, André Luís <[email protected]> wrote: >> Do you have any suggestions on how to deal with repetitions? I've >> tried parsing several pages of several websites and some of them used >> rel-tags on tagclouds... these would be present on every page (sidebar >> of blog) thus rendering the data kinda useless. > > --- do you have a real world example of where this would be a problem? > The old technorati kitchen crawled the web and allowed you to search > it. Having repetitions actually allowed for a nice merging of the > data. Right, in certain contexts it makes sense to merge data and end up with a more meaningful set of instances (of events, vcards, etc), but in others, not quite. I'll give an example. I coded a script that looks at a given page and grabs the rel-tags in that page. It then counts the occurrences and orders them in descending order. the script is at http://workshop.andr3.net/tageater/ this was meant to infer the user's attention profile from the rel-tags... the problem starts if I follow the rel-* links. For example the website macacos.com marks-up the tagcloud with rel-tags on every page, so if I follow the rel-archives I'll end up getting the tagcloud on every one of them... Have a look at http://workshop.andr3.net/tageater/?url=http%3A%2F%2Fmacacos.com I'm not following the links here because I was stuck with this doubt so I just print a link to them. Using rel-tags in tagclouds might be discouraged, but the fact is that it happens quite a bit in the wild. I saved a static html page of the scraping I did back then at all the barcamp atendees' webpages. you can have a look here: http://workshop.andr3.net/tageater/examples/barcamp.html , but for instance these are a few that use rel-tags on tagclouds: - http://macacos.com/ - http://www.devile.net/ - http://blog.pfragoso.org/ - http://www.brunoamaral.com/ - ... So, how to detect repetition in these cases? > >> Should/can we create guidelines for producers AND parsers alike on how >> to deal with this? Like adding site-wide unique id's to the root >> elements? Or is this out of the scope of microformats altogether? > > --- again, this would depend on the format in question. The existance > of multiple events with the same timestamp and name could be used to > merge data, UIDs and URLs could be as well, but everything could be > gamed. So what you're saying is that this falls out of the spec's scope, right? It should be the parsers adapting their behaviour depending on their goal? > > But this isn´t unique to microformats, other semantic technologies > would have this issue as well. There was talk of a rel-canonical > awhile ago, but it wasn't big enough a problem to pursue. You're right. Do you have a link where I can read more about that discussion? Thanks. > > If you have an example we could work through it. > > -brian > cheers, -- André Luís _______________________________________________ microformats-new mailing list [email protected] http://microformats.org/mailman/listinfo/microformats-new
