[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

Gokhan Capan (JIRA) Mon, 14 Apr 2014 04:24:08 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968254#comment-13968254
 ]


Gokhan Capan commented on MAHOUT-1178:
--------------------------------------

The thing is it just 'loads' a Lucene index in memory as a matrix. You 
construct a matrix with the lucene index directory location and that's it. So 
it is not a fix for incremental document management issue.

The alternative approach is querying the index when a row/column vector, or 
cell is required. I, however, am not sure if the SolrMatrix thing is fast 
enough for that.

I haven't been available lately, and now I'm reading through the changes in and 
proposals for Mahout's future, and trying to set up my perspective for Mahout2. 
We probably can come up with a better way of document storage (still 
Lucene/Solr based). Let me leave this as is now, and then we can discuss the 
input formats further.

Is that OK for you?

> GSOC 2013: Improve Lucene support in Mahout
> -------------------------------------------
>
>                 Key: MAHOUT-1178
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1178
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Dan Filimon
>            Assignee: Gokhan Capan
>              Labels: gsoc2013, mentor
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

Reply via email to