On Wed, Jan 14, 2009 at 9:49 PM, Brian Suda <[email protected]> wrote: > On 1/14/09, André Luís <[email protected]> wrote: > > I coded a script that looks at a given page and grabs the rel-tags in > > that page. It then counts the occurrences and orders them in > > descending order. > > > > the script is at http://workshop.andr3.net/tageater/ > > > > this was meant to infer the user's attention profile from the rel-tags... > > > > the problem starts if I follow the rel-* links. For example the > > website macacos.com marks-up the tagcloud with rel-tags on every page, > > > >> So, how to detect repetition in these cases? > > > --- wouldn't you just keep a list of the pages you have already > crawled? So if you find a tagcloud on page /item1.html and it links to > /tags/tag1 then on page item2.htm you re-find the tag cloud which > links to /tags/tag1 you don't follow it again? >
Like Toby said in a later reply (which I'll reply after this, to avoid confusion), I don't follow the tags, but I would follow the rel-[next|prev|archives|...] links.. so the same set of tags keep popping up... even if the url changes (and yes, you should keep a bucket of crawled-links to avoid infinite loops) if you keep getting the same set of tags, it will only increase in number of occurrences thus, the weight loses meaning. However, from my little testing and later interview with the sites owners, I think the weight of each tag is relative... since pretty much all of the tags are meaningful to the owner of the website... you just can't say that X > Y... but you can say that the owner of that site is at least interested in X and Y. Unless you see some holes in my logic. ;) > > > So what you're saying is that this falls out of the spec's scope, > > right? It should be the parsers adapting their behaviour depending on > > their goal? > > > --- probably out of side of the spec, but certainly a best-practices > should cover these sorts of issues. > Agreed. > > > You're right. Do you have a link where I can read more about that > > discussion? Thanks. > > > There was discussion about canonical hCards 2 years ago > > http://microformats.org/discuss/mail/microformats-discuss/2007-January/008265.html > > I am not sure how helpful any of that discussion was/is to this problem. > Alright, I'll have a look. And on the wiki as well. I think tags are a whole different matter though, because they're based on a single element (just like xfn, and other rel-based ufs) -- André Luís _______________________________________________ microformats-new mailing list [email protected] http://microformats.org/mailman/listinfo/microformats-new
