What I found with the technorati crawler was that the atom timestamps
were mroe reliabel than RSS, as RSS timezones were underspecified.
Talking of hAtom, here's a tool that uses it:
http://googlenotebookblog.blogspot.com/2008/01/permalinks-up-and-hatom.html
Niall told em last night that he's
Rob Crowther wrote:
On 17/01/2008, Derrick Lyndon Pallas [EMAIL PROTECTED] wrote:
Not so. The Internet Archive knows the first time they've seen an URL,
over the past ten years; they can also tell you when the content has
significantly changed.
But can it tell you whether it's a new
On Jan 18, 2008 12:06 AM, Kevin Marks [EMAIL PROTECTED] wrote:
What I found with the technorati crawler was that the atom timestamps
were mroe reliabel than RSS, as RSS timezones were underspecified.
Talking of hAtom, here's a tool that uses it:
On Wed, January 16, 2008 11:04 pm, ryan wrote:
On Jan 16, 2008, at 12:41 AM, Kevin Burton wrote:
Has anyone done any large scale audits of XHTML in the wild to
determine the percentage that parse correctly?
Yes, Ian Hickson at Google did a survey of about 1B pages and found
that over 90%
Le 17 janv. 2008 à 19:22, Nick Fitzsimons a écrit :
I can't imagine that things have got any better since :-(
to really evaluate this, there are two parameters to take into account.
nb of xhtml pages
- [now]
nb of total pages
but in my humble opinion, more interesting
Karl Dubost wrote:
to really evaluate this, there are two parameters to take into account.
nb of xhtml pages
- [now]
nb of total pages
I can probably tell you both of those numbers for the last couple of
months. Knowing how many pages are malformed might take a bit longer.
On 17 Jan 2008, at 01:44, Kevin Burton wrote:
Specifically, the probability that a naive non-XML parser can make
while indexing the content.
I'm not sure what you mean here, but I'd reccomend against using an
XML parser against web content and instead use something like the
HTML5 parsing
On 17/01/2008, Derrick Lyndon Pallas [EMAIL PROTECTED] wrote:
Not so. The Internet Archive knows the first time they've seen an URL,
over the past ten years; they can also tell you when the content has
significantly changed.
But can it tell you whether it's a new page or an old page at a new
but in my humble opinion, more interesting would be to have this ratio
for each year with *only the new pages* created during the year.
Unfortunately because there is no uniform way to sign the date of
pages, and because HTTP is even a worse shape than HTML, it is almost
impossible to
Not so. The Internet Archive knows the first time they've seen an URL,
over the past ten years; they can also tell you when the content has
significantly changed. Obviously, there is a bias towards pages (and
sites) with higher traffic, but that seems reasonable if you're
evaluating standard
Le 18 janv. 2008 à 09:03, Kevin Burton a écrit :
On could perform such an audit with hAtom published values. Either
that or use the RSS timestamp or timestamp in the URL.
hmm maybe an intermediate possibility, Timestamp of domain creation.
whois microformats.org
Created On:26-Jan-2005
Has anyone done any large scale audits of XHTML in the wild to
determine the percentage that parse correctly?
I'm thinking about deploying one in Spinn3r but I'd rather focus on
other tasks if this has already been done.
I'm curious about the assumptions one could make when assuming that
XHTML
On Jan 16, 2008, at 12:41 AM, Kevin Burton wrote:
Has anyone done any large scale audits of XHTML in the wild to
determine the percentage that parse correctly?
Yes, Ian Hickson at Google did a survey of about 1B pages and found
that over 90% had *well-formedness* errors. I can't find a
Specifically, the probability that a naive non-XML parser can make
while indexing the content.
I'm not sure what you mean here, but I'd reccomend against using an
XML parser against web content and instead use something like the
HTML5 parsing algorithm [#html5-parsing].
Yes... I'm just
Kevin Burton wrote:
I'm not sure what you mean here, but I'd reccomend against using an
XML parser against web content and instead use something like the
HTML5 parsing algorithm [#html5-parsing].
Yes... I'm just trying to avoid using a full HTML parser (DOM or not)
to avoid garbage
15 matches
Mail list logo