> I was basically thinking of using lucene to generate document
> vectors, and writing my custom similarity algorithms for measuring
> distance.
>
> I could then run this data through k-means or SOM algorithms for
> calculating clusters

First of all, I think it would already be great if there was some
functionality for simply storing document vectors during the indexing
process, so you could later on use

  IndexSearcher.docTerms(int i)

to retrieve a BitSet or an array of floats that are weighted so that
frequent terms have lower values.

One difficulty I see here is that terms don't seem to have any unique
identifiers, guess you'd have to manage those yourself...

--
Eric Jain


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to