Harryos, Den 28/09/2010 kl. 09.56 skrev harryos:
> thanks Erik, > By 'update' I meant a major addition/removal of text(say 100 > characters). > Initially I thought of making hash of a page and comparing it to the > saved hash of the same page at a different moment of time..But ,this > would > cause even a tiny change to be considered as an update..I would like > to use a filter to set an update of x number of characters. > May be using f=urllib.urlopen and > currentsize=len(f.read()) will let me find the number of added/ > removed characters..and set the filter accordingly.. Content length (which you could also get using the HTTP header "Content Length") won't necessarily tell you if content has changed. I think your problem is a candidate for http://en.wikipedia.org/wiki/Levenshtein_distance (calculating the "distance" between two strings), for which I think there are Python implementations. Depending on your requirements, you could add other heuristics to detect major changes, e.g. load the page into an XML parser and only check certain <div>'s. But further suggestions would require more information on your problem. Kind regards, Erik
smime.p7s
Description: S/MIME cryptographic signature