On Mon, May 02, 2005 at 01:51:16PM -0700, Dan Quinlan wrote: > message. For instance, the same URL shows up multiple times with the > current API. I'd like to be able to do things like:
How so? Also, keep in mind there's a difference between the text parse and html parsed uris. If we wanted to merge those together, it'd be pretty easy. > - how many URLs were there originally and what were they? keys %array in scalar and array context. > - what are the list of sites users could likely go to? > (so, canonicalized anchor destinations (ignore the stuff > between <a> and </a> in those cases) plus non-hyperlink "cut and > paste or expect MUA to hyperlinkize" ones in text where there wasn't > a real anchor) I guess it depends what you mean by "sites". If hostname or domain, that's pretty trivial, see the URIBL code. > - which URLs don't match their text? Easily checked. See EvalTests::check_https_ip_mismatch(). > - which URLs were over-encoded and where do *they* ultimately go? You could compare the key and the list of canonicals, but it depends what your algorithm would be. -- Randomly Generated Tagline: "Hurry up! I wanna see the moon." -Fry "Relax. It's open 'till nine." -Leela
pgpv64jwnqt2r.pgp
Description: PGP signature
