The distinguishing characteristics you mark out and put in a field may not be so distinguishing as more content is added to an index (e.g. use of new terminology like "podcast" becomes more prevalent). Maintaining/regenerating this field in anything other than a static index then starts to look like a non-trivial overhead.
While we are musing on this, I'm not sure that with things like MoreLikeThis (or the BooleanQuery scoring?) we have considered the true value of *coincidences* of terms rather than independently summing their individual IDFs. For example, given terms "female", "John" and "London" - all 3 may have equal IDF but should a document representing a female in London be given equal weighting to a document representing the rarer example of a female who happens to be called "John"? Considering these pairings adds extra complexity/cost but might be an interesting avenue to explore for some apps when selecting distinguishing characteristics or weighting query results. Cheers Mark ----- Original Message ---- From: karl wettin <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, 9 February, 2007 8:31:05 AM Subject: Reduction based "more like this"? I just woke up thinking it would be cool to attempt reducing the data of all documents using PCA (or so) and store the output in a new field per dimention introduced in order to find similair documents by placing a simple proximity query. Did anyone attempt something like this? I did not think this through that much. Nor do I need this feature. Just think it would be a cool experiment. -- karl --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___________________________________________________________ Inbox full of unwanted email? Get leading protection and 1GB storage with All New Yahoo! Mail. http://uk.docs.yahoo.com/nowyoucan.html --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]