tks Ted, but if its a live system, and you have 10 million documents, then isn't the computation on the fly going to be a pain, if you add say 1000 docs per hour or whatever, which is why I was assuming that its a batch process.
Also I think I have worked out what I meant about the relationships between the words themselves, I think I was looking to build a term-term matrix instead of a term-doc, whereby I have the freq of occurence of each word alongside each other word in a doc.(I guess easy way to start is that the two words can co-occur anywhere in the doc). If done, hopefully the 'distance' between the two vectors should give me a relative relationship. I realise lots of problems with this approach. i.e how don't know how the words are related...I just know that they are. Paul ________________________________ From: Ted Dunning <[email protected]> To: [email protected] Sent: Wednesday, 24 June, 2009 1:52:41 Subject: Re: LSI, cosine and others which use vectors There are two kinds of changes here. The first kind is when a single document changes. That will change the distances between that document and others, but it won't change the distances between two other documents. Most importantly, it won't change the distance between queries and other documents. The second kind of change is due to the first and is relatively unavoidable. When a document changes, almost inevitably the corpus word frequencies will change as a result. This changes the weightings applied to particular terms in documents. When you have many documents of which few change these changes will be small enough to ignore. In practice, you don't much care about what has changed because a live system computes all similarities or distances on the fly based on the current state. If the similarities that you have not yet computed change, you don't care. On Tue, Jun 23, 2009 at 5:01 PM, Paul Jones <[email protected]>wrote: > Yes another question, am going through a rapid learning curve... > > All these vector based systems, which require you to build a term-doc etc, > are they of any use in a system where the data is changing, i.e lets assume > the docs are webpages, which are being crawled, and hence updated. Surely if > there is a vector diagram being formed, then the position of these vectors > changes based on the changes (size, content) of the entire matrix, or am I > missing something here. > > If the above is correct, then is a actual live project how is this done, > are distances worked out on a per-day type of basis, and the indexes then > updated ? > > Paul > > > > -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 http://www.deepdyve.com 858-414-0013 (m) 408-773-0220 (fax)
