The distinguishing characteristics you mark out and put in a field may not be 
so distinguishing as more content is added to an index (e.g. use of new 
terminology like "podcast" becomes more prevalent). Maintaining/regenerating 
this field in anything other than a static index then starts to look like a 
non-trivial overhead.

While we are musing on this, I'm not sure that with things like MoreLikeThis 
(or the BooleanQuery scoring?) we have considered the true value of 
*coincidences* of terms rather than independently summing their individual 
IDFs. For example, given terms "female", "John" and "London" - all 3 may have 
equal IDF but should a document representing a female in London be given equal 
weighting to a document representing  the rarer example of a female who happens 
to be called "John"? Considering these pairings adds extra complexity/cost but 
might be an interesting avenue to explore for some apps when selecting 
distinguishing characteristics or weighting query results.

Cheers
Mark




----- Original Message ----
From: karl wettin <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, 9 February, 2007 8:31:05 AM
Subject: Reduction based "more like this"?

I just woke up thinking it would be cool to attempt reducing the data  
of all documents using PCA (or so) and store the output in a new  
field per dimention introduced in order to find similair documents by  
placing a simple proximity query. Did anyone attempt something like  
this?

I did not think this through that much. Nor do I need this feature.  
Just think it would be a cool experiment.

-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






                
___________________________________________________________ 
Inbox full of unwanted email? Get leading protection and 1GB storage with All 
New Yahoo! Mail. http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to