[jira] Created: (MAHOUT-7) Lucene indexes should act as matrix factories

Ted Dunning (JIRA) Sun, 24 Feb 2008 11:00:03 -0800

Lucene indexes should act as matrix factories
---------------------------------------------


                 Key: MAHOUT-7
                 URL: https://issues.apache.org/jira/browse/MAHOUT-7
             Project: Mahout
          Issue Type: New Feature
            Reporter: Ted Dunning



It would be highly desirable to be able to extract virtual matrices from lucene 
indexes.

The factory methods that I know of would include:

a) the factory would accept the name of a single field and the resulting matrix 
would use document id's as row labels and terms as column labels.  The values 
would be the term counts in the document (if available), or 1 if the term is in 
the document, but the term frequency is not available.  This implies that 
TermVectors could be viewed as rows of this matrix.  Columns could be extract 
by boolean retrieval from the index.  Retrieval from that field could be 
considered a form of matrix-vector multiplication where the vector encodes a 
query using the values as term boosts and the result wraps a hit structure as a 
sparse matrix.  Matrix-matrix arithmetic with pairs of this kind of matrix 
should yield a matrix as in (b).

b) the factory would accept a linear combination of terms and the resulting 
matrix would have rows which are linear combinations of the underlying 
term-vectors (could this be done latently so computation is only on access?  
would that help?).  Column access would be a form of retrieval (but what would 
the semantics be?).  Matrix vector product could again be viewed as retrieval, 
but it would probably be most useful to view the original coefficients as 
boosts for the Lucene scoring mechanism rather than computing a linear 
combination of scores.

c) the factor would produce a matrix in which the rows are all documents and 
the columns are all terms from all fields, each labeled with field and term 
name (probably using lucene query syntax).  Rows would be the concatenation of 
all term vectors, columns would represent retrieval on a single term.  Matrix 
vector multiplication would be general Lucene retrieval.  Matrix-matrix 
operations between lucene indexes should do something interesting (A' A, for 
instance might compute term coocurrence), but that seems pretty hairy to 
specify.  Matrix-matrix operations with ordinary matrices on the right might 
best be considered as multiple retrievals using each column of the right hand 
matrix as query.

d) as with (c), but with only a defined list of fields with the rest of the 
fields not being expressed as columns.

Issues with this API mostly center around efficiency of how to deal with 
expressions involving indexes (should operations be eager or lazy) and whether 
the use of multiplication as retrieval is too controversial.  An alternative 
might be to add a query operation to the API just for indexes.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (MAHOUT-7) Lucene indexes should act as matrix factories

Reply via email to