Harryos,

Den 28/09/2010 kl. 09.56 skrev harryos:

> thanks Erik,
> By 'update' I meant a major addition/removal of text(say 100
> characters).
> Initially I thought of making hash of a page and comparing it to the
> saved hash of the same page  at a different moment of time..But ,this
> would
> cause even a tiny change to be considered as an update..I would like
> to use a filter to set an update of x number of characters.
> May be using f=urllib.urlopen and
> currentsize=len(f.read())  will let me find the number of added/
> removed characters..and set the filter accordingly..


Content length (which you could also get using the HTTP header "Content 
Length") won't necessarily tell you if content has changed. I think your 
problem is a candidate for http://en.wikipedia.org/wiki/Levenshtein_distance 
(calculating the "distance" between two strings), for which I think there are 
Python implementations.

Depending on your requirements, you could add other heuristics to detect major 
changes, e.g. load the page into an XML parser and only check certain <div>'s. 
But further suggestions would require more information on your problem.

Kind regards,

Erik

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to