[ 
https://issues.apache.org/jira/browse/MAHOUT-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12571968#action_12571968
 ] 

Paul Elschot commented on MAHOUT-7:
-----------------------------------

Some more functionality for this:
- feature preselection, suppressing terms that occur in too few documents, 
effectively using only a subset of the available terms.
- use a subset of the documents, for example as a training class,
- more normalization of the term vectors, by max frequency in a document, or by 
total frequency in a document.



> Lucene indexes should act as matrix factories
> ---------------------------------------------
>
>                 Key: MAHOUT-7
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-7
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>
> It would be highly desirable to be able to extract virtual matrices from 
> lucene indexes.
> The factory methods that I know of would include:
> a) the factory would accept the name of a single field and the resulting 
> matrix would use document id's as row labels and terms as column labels.  The 
> values would be the term counts in the document (if available), or 1 if the 
> term is in the document, but the term frequency is not available.  This 
> implies that TermVectors could be viewed as rows of this matrix.  Columns 
> could be extract by boolean retrieval from the index.  Retrieval from that 
> field could be considered a form of matrix-vector multiplication where the 
> vector encodes a query using the values as term boosts and the result wraps a 
> hit structure as a sparse matrix.  Matrix-matrix arithmetic with pairs of 
> this kind of matrix should yield a matrix as in (b).
> b) the factory would accept a linear combination of terms and the resulting 
> matrix would have rows which are linear combinations of the underlying 
> term-vectors (could this be done latently so computation is only on access?  
> would that help?).  Column access would be a form of retrieval (but what 
> would the semantics be?).  Matrix vector product could again be viewed as 
> retrieval, but it would probably be most useful to view the original 
> coefficients as boosts for the Lucene scoring mechanism rather than computing 
> a linear combination of scores.
> c) the factor would produce a matrix in which the rows are all documents and 
> the columns are all terms from all fields, each labeled with field and term 
> name (probably using lucene query syntax).  Rows would be the concatenation 
> of all term vectors, columns would represent retrieval on a single term.  
> Matrix vector multiplication would be general Lucene retrieval.  
> Matrix-matrix operations between lucene indexes should do something 
> interesting (A' A, for instance might compute term coocurrence), but that 
> seems pretty hairy to specify.  Matrix-matrix operations with ordinary 
> matrices on the right might best be considered as multiple retrievals using 
> each column of the right hand matrix as query.
> d) as with (c), but with only a defined list of fields with the rest of the 
> fields not being expressed as columns.
> Issues with this API mostly center around efficiency of how to deal with 
> expressions involving indexes (should operations be eager or lazy) and 
> whether the use of multiplication as retrieval is too controversial.  An 
> alternative might be to add a query operation to the API just for indexes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to