Re: [uf-discuss] stats on well formed XHTML

2008-01-18 Thread Kevin Marks
What I found with the technorati crawler was that the atom timestamps were mroe reliabel than RSS, as RSS timezones were underspecified. Talking of hAtom, here's a tool that uses it: http://googlenotebookblog.blogspot.com/2008/01/permalinks-up-and-hatom.html Niall told em last night that he's

Re: [uf-discuss] stats on well formed XHTML

2008-01-18 Thread Derrick Lyndon Pallas
Rob Crowther wrote: On 17/01/2008, Derrick Lyndon Pallas [EMAIL PROTECTED] wrote: Not so. The Internet Archive knows the first time they've seen an URL, over the past ten years; they can also tell you when the content has significantly changed. But can it tell you whether it's a new

Re: [uf-discuss] stats on well formed XHTML

2008-01-18 Thread Kevin Burton
On Jan 18, 2008 12:06 AM, Kevin Marks [EMAIL PROTECTED] wrote: What I found with the technorati crawler was that the atom timestamps were mroe reliabel than RSS, as RSS timezones were underspecified. Talking of hAtom, here's a tool that uses it:

Re: [uf-discuss] stats on well formed XHTML

2008-01-17 Thread Nick Fitzsimons
On Wed, January 16, 2008 11:04 pm, ryan wrote: On Jan 16, 2008, at 12:41 AM, Kevin Burton wrote: Has anyone done any large scale audits of XHTML in the wild to determine the percentage that parse correctly? Yes, Ian Hickson at Google did a survey of about 1B pages and found that over 90%

Re: [uf-discuss] stats on well formed XHTML

2008-01-17 Thread Karl Dubost
Le 17 janv. 2008 à 19:22, Nick Fitzsimons a écrit : I can't imagine that things have got any better since :-( to really evaluate this, there are two parameters to take into account. nb of xhtml pages - [now] nb of total pages but in my humble opinion, more interesting

Re: [uf-discuss] stats on well formed XHTML

2008-01-17 Thread Derrick Lyndon Pallas
Karl Dubost wrote: to really evaluate this, there are two parameters to take into account. nb of xhtml pages - [now] nb of total pages I can probably tell you both of those numbers for the last couple of months. Knowing how many pages are malformed might take a bit longer.

Re: [uf-discuss] stats on well formed XHTML

2008-01-17 Thread Geoffrey Sneddon
On 17 Jan 2008, at 01:44, Kevin Burton wrote: Specifically, the probability that a naive non-XML parser can make while indexing the content. I'm not sure what you mean here, but I'd reccomend against using an XML parser against web content and instead use something like the HTML5 parsing

Re: [uf-discuss] stats on well formed XHTML

2008-01-17 Thread Rob Crowther
On 17/01/2008, Derrick Lyndon Pallas [EMAIL PROTECTED] wrote: Not so. The Internet Archive knows the first time they've seen an URL, over the past ten years; they can also tell you when the content has significantly changed. But can it tell you whether it's a new page or an old page at a new

Re: [uf-discuss] stats on well formed XHTML

2008-01-17 Thread Kevin Burton
but in my humble opinion, more interesting would be to have this ratio for each year with *only the new pages* created during the year. Unfortunately because there is no uniform way to sign the date of pages, and because HTTP is even a worse shape than HTML, it is almost impossible to

Re: [uf-discuss] stats on well formed XHTML

2008-01-17 Thread Kevin Burton
Not so. The Internet Archive knows the first time they've seen an URL, over the past ten years; they can also tell you when the content has significantly changed. Obviously, there is a bias towards pages (and sites) with higher traffic, but that seems reasonable if you're evaluating standard

Re: [uf-discuss] stats on well formed XHTML

2008-01-17 Thread Karl Dubost
Le 18 janv. 2008 à 09:03, Kevin Burton a écrit : On could perform such an audit with hAtom published values. Either that or use the RSS timestamp or timestamp in the URL. hmm maybe an intermediate possibility, Timestamp of domain creation. whois microformats.org Created On:26-Jan-2005

[uf-discuss] stats on well formed XHTML

2008-01-16 Thread Kevin Burton
Has anyone done any large scale audits of XHTML in the wild to determine the percentage that parse correctly? I'm thinking about deploying one in Spinn3r but I'd rather focus on other tasks if this has already been done. I'm curious about the assumptions one could make when assuming that XHTML

Re: [uf-discuss] stats on well formed XHTML

2008-01-16 Thread ryan
On Jan 16, 2008, at 12:41 AM, Kevin Burton wrote: Has anyone done any large scale audits of XHTML in the wild to determine the percentage that parse correctly? Yes, Ian Hickson at Google did a survey of about 1B pages and found that over 90% had *well-formedness* errors. I can't find a

Re: [uf-discuss] stats on well formed XHTML

2008-01-16 Thread Kevin Burton
Specifically, the probability that a naive non-XML parser can make while indexing the content. I'm not sure what you mean here, but I'd reccomend against using an XML parser against web content and instead use something like the HTML5 parsing algorithm [#html5-parsing]. Yes... I'm just

Re: [uf-discuss] stats on well formed XHTML

2008-01-16 Thread Derrick Lyndon Pallas
Kevin Burton wrote: I'm not sure what you mean here, but I'd reccomend against using an XML parser against web content and instead use something like the HTML5 parsing algorithm [#html5-parsing]. Yes... I'm just trying to avoid using a full HTML parser (DOM or not) to avoid garbage