There are two kinds of changes here. The first kind is when a single document changes. That will change the distances between that document and others, but it won't change the distances between two other documents. Most importantly, it won't change the distance between queries and other documents.
The second kind of change is due to the first and is relatively unavoidable. When a document changes, almost inevitably the corpus word frequencies will change as a result. This changes the weightings applied to particular terms in documents. When you have many documents of which few change these changes will be small enough to ignore. In practice, you don't much care about what has changed because a live system computes all similarities or distances on the fly based on the current state. If the similarities that you have not yet computed change, you don't care. On Tue, Jun 23, 2009 at 5:01 PM, Paul Jones <[email protected]>wrote: > Yes another question, am going through a rapid learning curve... > > All these vector based systems, which require you to build a term-doc etc, > are they of any use in a system where the data is changing, i.e lets assume > the docs are webpages, which are being crawled, and hence updated. Surely if > there is a vector diagram being formed, then the position of these vectors > changes based on the changes (size, content) of the entire matrix, or am I > missing something here. > > If the above is correct, then is a actual live project how is this done, > are distances worked out on a per-day type of basis, and the indexes then > updated ? > > Paul > > > > -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 http://www.deepdyve.com 858-414-0013 (m) 408-773-0220 (fax)
