Try looking at the random indexing literature. Sparse binary context vectors should give you pretty much what you need for the context similarity.
You can encode your existing synonyms and learn cooccurrence based synonyms at the same time. In order to allow you to query using any of these systems, you would have to increase the size of your index, but unless you have a huge system, that should be relatively easy. The idea is that your lucene index would contain separate fields for: a) the original words b) the synonym sets for the original words c) the non-zero content vector components For a query, you can form three components that correspond to these three fields and you can include or exclude these at will to find out what works well. http://www.sics.se/~mange/random_indexing.html http://code.google.com/p/semanticvectors/ http://portal.acm.org/citation.cfm?id=146565.146569 http://www.d.umn.edu/~tpederse/Pubs/eacl2006-vector.pdf On Fri, Jun 26, 2009 at 5:57 AM, Paul Jones <[email protected]>wrote: > What I had in mind was to > a) start with existing synonyms > > and then > > b) add to this system using various algos to determine word distance > > I have stayed away from solr, because from what I have read everyone seems > to pointing to the as a enterprise app, whereas I need something bigger, not > sure of this is correct >
