I added Andy's first suggestion and Ted's suggestion as ideas. Andy, could you flesh out your second suggestion into a project and make an issue please?
On Fri, Mar 29, 2013 at 3:53 AM, Ted Dunning <[email protected]> wrote: > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > > b) numeric fields ought to work somehow. > > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > > e) named vectors and matrices should be used if plausible. > > On Thu, Mar 28, 2013 at 4:58 PM, Dan Filimon <[email protected] > >wrote: > > > ... > > Ted, could you explain a bit more what you mean by "simplify the > connection > > to Lucene for clustering and classification"? It's too vague for an idea > > proposal. > > >
