Ok, updated!
On Fri, Mar 29, 2013 at 7:36 PM, Andy Twigg <[email protected]> wrote: > Dan, > > I think what you've written is fine (I wanted to edit to remove the > '?' around random forests but couldn't). > > ok? > > > > On 29 March 2013 11:14, Dan Filimon <[email protected]> wrote: > > I added Andy's first suggestion and Ted's suggestion as ideas. > > > > Andy, could you flesh out your second suggestion into a project and make > an > > issue please? > > > > > > On Fri, Mar 29, 2013 at 3:53 AM, Ted Dunning <[email protected]> > wrote: > > > >> It should be possible to view a Lucene index as a matrix. This would > >> require that we standardize on a way to convert documents to rows. > There > >> are many choices, the discussion of which should be deferred to the > actual > >> work on the project, but there are a few obvious constraints: > >> > >> a) it should be possible to get the same result as dumping the term > vectors > >> for each document each to a line and converting that result using > standard > >> Mahout methods. > >> > >> b) numeric fields ought to work somehow. > >> > >> c) if there are multiple text fields that ought to work sensibly as > well. > >> Two options include dumping multiple matrices or to convert the fields > >> into a single row of a single matrix. > >> > >> d) it should be possible to refer back from a row of the matrix to find > the > >> correct document. THis might be because we remember the Lucene doc > number > >> or because a field is named as holding a unique id. > >> > >> e) named vectors and matrices should be used if plausible. > >> > >> On Thu, Mar 28, 2013 at 4:58 PM, Dan Filimon < > [email protected] > >> >wrote: > >> > >> > ... > >> > Ted, could you explain a bit more what you mean by "simplify the > >> connection > >> > to Lucene for clustering and classification"? It's too vague for an > idea > >> > proposal. > >> > > >> > > > > -- > Dr Andy Twigg > Junior Research Fellow, St Johns College, Oxford > Room 351, Department of Computer Science > http://www.cs.ox.ac.uk/people/andy.twigg/ > [email protected] | +447799647538 >
