[ 
https://issues.apache.org/jira/browse/MAHOUT-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12572128#action_12572128
 ] 

Paul Elschot commented on MAHOUT-7:
-----------------------------------

{quote}... normalizations should be quite simple once we have a matrix lib in 
place. Seems like it'd be better to keep the interface simple and do 
normalizations on the mahout end. Any benefit I'm not thinking of?
{quote}

Assuming sparse vector access is fast, there is no benefit. With termvectors 
It's easy to do on the lucene side, that's all.

{quote}
Document subset seems possibly valuable, esp. when developing code and testing 
w/ a toy-sized data set. Would need a way to specify the subset---probably w/ a 
query, no? User would need to flag the documents upon insertion into the index.
{quote}

Document subset selection with a query is indeed a straightforward way. A 
lucene Filter would be good when the flag(s) are not in the lucene index. 
Perhaps Filter should be the only way, as lucene has a QueryFilter that could 
be used to make a Filter out of a query.

> Lucene indexes should act as matrix factories
> ---------------------------------------------
>
>                 Key: MAHOUT-7
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-7
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>
> It would be highly desirable to be able to extract virtual matrices from 
> lucene indexes.
> The factory methods that I know of would include:
> a) the factory would accept the name of a single field and the resulting 
> matrix would use document id's as row labels and terms as column labels.  The 
> values would be the term counts in the document (if available), or 1 if the 
> term is in the document, but the term frequency is not available.  This 
> implies that TermVectors could be viewed as rows of this matrix.  Columns 
> could be extract by boolean retrieval from the index.  Retrieval from that 
> field could be considered a form of matrix-vector multiplication where the 
> vector encodes a query using the values as term boosts and the result wraps a 
> hit structure as a sparse matrix.  Matrix-matrix arithmetic with pairs of 
> this kind of matrix should yield a matrix as in (b).
> b) the factory would accept a linear combination of terms and the resulting 
> matrix would have rows which are linear combinations of the underlying 
> term-vectors (could this be done latently so computation is only on access?  
> would that help?).  Column access would be a form of retrieval (but what 
> would the semantics be?).  Matrix vector product could again be viewed as 
> retrieval, but it would probably be most useful to view the original 
> coefficients as boosts for the Lucene scoring mechanism rather than computing 
> a linear combination of scores.
> c) the factor would produce a matrix in which the rows are all documents and 
> the columns are all terms from all fields, each labeled with field and term 
> name (probably using lucene query syntax).  Rows would be the concatenation 
> of all term vectors, columns would represent retrieval on a single term.  
> Matrix vector multiplication would be general Lucene retrieval.  
> Matrix-matrix operations between lucene indexes should do something 
> interesting (A' A, for instance might compute term coocurrence), but that 
> seems pretty hairy to specify.  Matrix-matrix operations with ordinary 
> matrices on the right might best be considered as multiple retrievals using 
> each column of the right hand matrix as query.
> d) as with (c), but with only a defined list of fields with the rest of the 
> fields not being expressed as columns.
> Issues with this API mostly center around efficiency of how to deal with 
> expressions involving indexes (should operations be eager or lazy) and 
> whether the use of multiplication as retrieval is too controversial.  An 
> alternative might be to add a query operation to the API just for indexes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to