Drew has an early code drop that should be posted shortly.  He has a generic
AvroWritable that can serialize anything with an appropriate schema.  That
changes your names and philosophy a bit.

Regarding n-grams, I think that will be best combined with a non-dictionary
based vectorizer because of the large implied vocabulary that would
otherwise result.  Also, in many cases vectorization and n-gram generation
is best done in the learning algorithm itself to avoid moving massive
amounts of data.  As such, vectorization will probably need to be a library
rather than a map-reduce program.


On Thu, Feb 4, 2010 at 7:49 PM, Robin Anil <robin.a...@gmail.com> wrote:

> Lets break it down into milestones. See if you agree on the following(even
> ClassNames ?)
>
> On Fri, Feb 5, 2010 at 12:27 AM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
>
> > These are good questions.  I see the best course as answering these kinds
> > of
> > questions in phases.
> >
> > First, the only thing that is working right now is the current text =>
> > vector stuff.  We should continue to refine this with alternative forms
> of
> > vectorization (random indexing, stochastic projection as well as the
> > current
> > dictionary approach).
> >
> > The input all these vectorization job is StucturedDocumentWritable format
> which you and Drew will work on(Avro based)
>
> To create the StructuredDocumentWritable format we have to write Mapreduces
> which will convert
> a) SequenceFile => SingleField token array using Analyzer
>     I am going with simple Document
> => StucturedDocumentWritable(encapsulating StringTuple)      in   M1.
>     Change it to StucturedDocumentWritable(                     in     M2
> b) Lucene Repo  => StucturedDocumentWritable                       M2
> c) Structured XML =>  StucturedDocumentWritable                  M2
> d) Other Formats/DataSources(RDBMS)  => StucturedDocumentWritable
>         M3
>
> Jobs using StructuredDocumentWritable
> a) DictionaryVectorizer -> Makes VectorWritable                     M1
> b) nGram Generator -> Makes ngrams ->
>          1) Appends to the dictionary -> Creates Partial Vectors -> Merges
> with vectors from Dictionary Vectorizer to create ngram based vectors
> M1
>          2) Appends to  other vectorizers(random indexing, stochastic) M1?
> or M2
> c) Random Indexing Job -> Makes VectorWritable  M1? or M2
> d) StochasticProjection Job -> Makes Vector writable  M1? or M2
>
>
> How does this sound ? Feel free to edit/reorder them
>
>
>
> A second step is to be able to store and represent more general documents
> > similar to what is possible with Lucene.  This is critically important
> for
> > some of the things that I want to do where I need to store and segregate
> > title, publisher, authors, abstracts and body text (and many other
> > characteristics ... we probably have >100 of them).  It is also
> critically
> > important if we want to embrace the dualism between recommendation and
> > search.  Representing documents can be done without discarding the
> simpler
> > approach we have now and it can be done in advance of good vectorization
> of
> > these complex documents.
> >
> > A third step is to define advanced vectorization for complex documents.
>  As
> > an interim step, we can simply vectorize using the dictionary and
> > alternative vectorizers that we have now, but applied to a single field
> of
> > the document.  Shortly, though, we should be able to define cross
> > occurrence
> > features for a multi-field vectorization.
> >
> > The only dependencies here are that the third step depends on the first
> and
> > second.
> >
> > You have been working on the Dictionary vectorizer.  I did a bit of work
> on
> > stochastic projection with some cooccurrence.
> >
> > In parallel Drew and I have been working on building an Avro document
> > schema.  This is driving forward on step 2.  I think that will actually
> > bear
> > some fruit quickly.  Once that is done, we should merge capabilities.  I
> am
> > hoping that the good momentum that you have established on (1) will mean
> > that merging your vectorization with the complex documents will be
> > relatively easy.
> >
> > Is that a workable idea?
> >
> > On Thu, Feb 4, 2010 at 10:45 AM, Robin Anil <robin.a...@gmail.com>
> wrote:
> >
> > > And how does it
> > > work with our sequence file format(string docid => string document>.
> All
> > we
> > > have is text=>text ?
> > > and finally its all vectors. How does same word in two different fields
> > > translate into vector?
> > >
> > > if you have a clear plan lets do it or lets do the first version with
> > just
> > >
> > > document -> analyzer -> token array -> vector
> > >                                                      |-> ngram ->
> vector
> > >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to