Re: Mahout 0.3 Plan and other changes

Ted Dunning Thu, 04 Feb 2010 10:58:45 -0800

These are good questions.  I see the best course as answering these kinds of
questions in phases.

First, the only thing that is working right now is the current text =>
vector stuff.  We should continue to refine this with alternative forms of
vectorization (random indexing, stochastic projection as well as the current
dictionary approach).

A second step is to be able to store and represent more general documents
similar to what is possible with Lucene.  This is critically important for
some of the things that I want to do where I need to store and segregate
title, publisher, authors, abstracts and body text (and many other
characteristics ... we probably have >100 of them).  It is also critically
important if we want to embrace the dualism between recommendation and
search.  Representing documents can be done without discarding the simpler
approach we have now and it can be done in advance of good vectorization of
these complex documents.

A third step is to define advanced vectorization for complex documents.  As
an interim step, we can simply vectorize using the dictionary and
alternative vectorizers that we have now, but applied to a single field of
the document.  Shortly, though, we should be able to define cross occurrence
features for a multi-field vectorization.

The only dependencies here are that the third step depends on the first and
second.

You have been working on the Dictionary vectorizer.  I did a bit of work on
stochastic projection with some cooccurrence.

In parallel Drew and I have been working on building an Avro document
schema.  This is driving forward on step 2.  I think that will actually bear
some fruit quickly.  Once that is done, we should merge capabilities.  I am
hoping that the good momentum that you have established on (1) will mean
that merging your vectorization with the complex documents will be
relatively easy.

Is that a workable idea?

On Thu, Feb 4, 2010 at 10:45 AM, Robin Anil <robin.a...@gmail.com> wrote:

> And how does it
> work with our sequence file format(string docid => string document>. All we
> have is text=>text ?
> and finally its all vectors. How does same word in two different fields
> translate into vector?
>
> if you have a clear plan lets do it or lets do the first version with just
>
> document -> analyzer -> token array -> vector
>                                                      |-> ngram -> vector
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Mahout 0.3 Plan and other changes

Reply via email to