Re: GSOC proposals and mentors [was Call to action – Mahout needs your help]

Gokhan Capan Tue, 09 Apr 2013 10:45:17 -0700

*- Row vectors are named and labeled with unique identifier field of the
index defined by the client



On Tue, Apr 9, 2013 at 8:43 PM, Gokhan Capan <[email protected]> wrote:

> I have an implementation of "casting" a Lucene index to a SparseRowMatrix,
> with following properties:
>
> - Row vectors are named and labeled with unique identifier id
> - Column vectors are labeled with terms
> - Dimensionality is numDocs * vocabularySize
> - It works on StringField, too.
> - It has a static creator for multiple fields, returns an array of matrix.
> - It doesn't support numerical fields, yet.
>
> The code is tested, and I use it for instantiating matrices from Lucene
> indexes. I can submit a patch if it is desired.
>
> This is in memory, and loads the entire index to the matrix. Lately I've
> decided to implement a persistent version of it, which is planned to load
> from index whenever a get request is made, and writes to actual index with
> a set request. And I plan to use the docID field, which was attached as the
> row label in previous implementation as the actual row index. Rest will be
> the same.
>
>
>
>
> On Fri, Mar 29, 2013 at 3:53 AM, Ted Dunning <[email protected]>wrote:
>
>> It should be possible to view a Lucene index as a matrix.  This would
>> require that we standardize on a way to convert documents to rows.  There
>> are many choices, the discussion of which should be deferred to the actual
>> work on the project, but there are a few obvious constraints:
>>
>> a) it should be possible to get the same result as dumping the term
>> vectors
>> for each document each to a line and converting that result using standard
>> Mahout methods.
>>
>> b) numeric fields ought to work somehow.
>>
>> c) if there are multiple text fields that ought to work sensibly as well.
>>  Two options include dumping multiple matrices or to convert the fields
>> into a single row of a single matrix.
>>
>> d) it should be possible to refer back from a row of the matrix to find
>> the
>> correct document.  THis might be because we remember the Lucene doc number
>> or because a field is named as holding a unique id.
>>
>> e) named vectors and matrices should be used if plausible.
>>
>> On Thu, Mar 28, 2013 at 4:58 PM, Dan Filimon <[email protected]
>> >wrote:
>>
>> > ...
>> > Ted, could you explain a bit more what you mean by "simplify the
>> connection
>> > to Lucene for clustering and classification"? It's too vague for an idea
>> > proposal.
>> >
>>
>
>
>
> --
> Gokhan
>



-- 
Gokhan

Re: GSOC proposals and mentors [was Call to action – Mahout needs your help]

Reply via email to