Ok, Honestly I didn't understand the cross-recommendation, and I guess for possible persistent Lucene Matrix implementation, the desired feature is a fast iterator, which computes next by querying the index. Am I correct?
Should I submit the diff for in memory version to MAHOUT-1178, or create a separate issue? On Wed, Apr 10, 2013 at 5:00 PM, Ted Dunning <[email protected]> wrote: > This is awesome. Not exactly what I asked for, but in some ways better > than what I asked for (I love it when that happens). I think a sequential > implementation like this is a fine place to start. > > The array of matrices should work very well for moderate to small cross > recommendation. If we have a sharded index, then we can build an > InputFormat on the top of this pretty easily. > > Can you put up a JIRA and a patch for this? > > > On Tue, Apr 9, 2013 at 10:43 AM, Gokhan Capan <[email protected]> wrote: > > > I have an implementation of "casting" a Lucene index to a > SparseRowMatrix, > > with following properties: > > > > - Row vectors are named and labeled with unique identifier id > > - Column vectors are labeled with terms > > - Dimensionality is numDocs * vocabularySize > > - It works on StringField, too. > > - It has a static creator for multiple fields, returns an array of > matrix. > > - It doesn't support numerical fields, yet. > > > > The code is tested, and I use it for instantiating matrices from Lucene > > indexes. I can submit a patch if it is desired. > > > > This is in memory, and loads the entire index to the matrix. Lately I've > > decided to implement a persistent version of it, which is planned to load > > from index whenever a get request is made, and writes to actual index > with > > a set request. And I plan to use the docID field, which was attached as > the > > row label in previous implementation as the actual row index. Rest will > be > > the same. > > > > > > > > > > On Fri, Mar 29, 2013 at 3:53 AM, Ted Dunning <[email protected]> > > wrote: > > > > > It should be possible to view a Lucene index as a matrix. This would > > > require that we standardize on a way to convert documents to rows. > There > > > are many choices, the discussion of which should be deferred to the > > actual > > > work on the project, but there are a few obvious constraints: > > > > > > a) it should be possible to get the same result as dumping the term > > vectors > > > for each document each to a line and converting that result using > > standard > > > Mahout methods. > > > > > > b) numeric fields ought to work somehow. > > > > > > c) if there are multiple text fields that ought to work sensibly as > well. > > > Two options include dumping multiple matrices or to convert the fields > > > into a single row of a single matrix. > > > > > > d) it should be possible to refer back from a row of the matrix to find > > the > > > correct document. THis might be because we remember the Lucene doc > > number > > > or because a field is named as holding a unique id. > > > > > > e) named vectors and matrices should be used if plausible. > > > > > > On Thu, Mar 28, 2013 at 4:58 PM, Dan Filimon < > > [email protected] > > > >wrote: > > > > > > > ... > > > > Ted, could you explain a bit more what you mean by "simplify the > > > connection > > > > to Lucene for clustering and classification"? It's too vague for an > > idea > > > > proposal. > > > > > > > > > > > > > > > -- > > Gokhan > > > -- Gokhan
