Re: GSOC proposals and mentors [was Call to action – Mahout needs your help]

Gokhan Capan Tue, 09 Apr 2013 10:44:18 -0700

I have an implementation of "casting" a Lucene index to a SparseRowMatrix,
with following properties:


- Row vectors are named and labeled with unique identifier id
- Column vectors are labeled with terms
- Dimensionality is numDocs * vocabularySize
- It works on StringField, too.
- It has a static creator for multiple fields, returns an array of matrix.
- It doesn't support numerical fields, yet.

The code is tested, and I use it for instantiating matrices from Lucene
indexes. I can submit a patch if it is desired.

This is in memory, and loads the entire index to the matrix. Lately I've
decided to implement a persistent version of it, which is planned to load
from index whenever a get request is made, and writes to actual index with
a set request. And I plan to use the docID field, which was attached as the
row label in previous implementation as the actual row index. Rest will be
the same.




On Fri, Mar 29, 2013 at 3:53 AM, Ted Dunning <[email protected]> wrote:

> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
>
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
>
> b) numeric fields ought to work somehow.
>
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
>
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
>
> e) named vectors and matrices should be used if plausible.
>
> On Thu, Mar 28, 2013 at 4:58 PM, Dan Filimon <[email protected]
> >wrote:
>
> > ...
> > Ted, could you explain a bit more what you mean by "simplify the
> connection
> > to Lucene for clustering and classification"? It's too vague for an idea
> > proposal.
> >
>



-- 
Gokhan

Re: GSOC proposals and mentors [was Call to action – Mahout needs your help]

Reply via email to