*- Row vectors are named and labeled with unique identifier field of the index defined by the client
On Tue, Apr 9, 2013 at 8:43 PM, Gokhan Capan <[email protected]> wrote: > I have an implementation of "casting" a Lucene index to a SparseRowMatrix, > with following properties: > > - Row vectors are named and labeled with unique identifier id > - Column vectors are labeled with terms > - Dimensionality is numDocs * vocabularySize > - It works on StringField, too. > - It has a static creator for multiple fields, returns an array of matrix. > - It doesn't support numerical fields, yet. > > The code is tested, and I use it for instantiating matrices from Lucene > indexes. I can submit a patch if it is desired. > > This is in memory, and loads the entire index to the matrix. Lately I've > decided to implement a persistent version of it, which is planned to load > from index whenever a get request is made, and writes to actual index with > a set request. And I plan to use the docID field, which was attached as the > row label in previous implementation as the actual row index. Rest will be > the same. > > > > > On Fri, Mar 29, 2013 at 3:53 AM, Ted Dunning <[email protected]>wrote: > >> It should be possible to view a Lucene index as a matrix. This would >> require that we standardize on a way to convert documents to rows. There >> are many choices, the discussion of which should be deferred to the actual >> work on the project, but there are a few obvious constraints: >> >> a) it should be possible to get the same result as dumping the term >> vectors >> for each document each to a line and converting that result using standard >> Mahout methods. >> >> b) numeric fields ought to work somehow. >> >> c) if there are multiple text fields that ought to work sensibly as well. >> Two options include dumping multiple matrices or to convert the fields >> into a single row of a single matrix. >> >> d) it should be possible to refer back from a row of the matrix to find >> the >> correct document. THis might be because we remember the Lucene doc number >> or because a field is named as holding a unique id. >> >> e) named vectors and matrices should be used if plausible. >> >> On Thu, Mar 28, 2013 at 4:58 PM, Dan Filimon <[email protected] >> >wrote: >> >> > ... >> > Ted, could you explain a bit more what you mean by "simplify the >> connection >> > to Lucene for clustering and classification"? It's too vague for an idea >> > proposal. >> > >> > > > > -- > Gokhan > -- Gokhan
