Lets break it down into milestones. See if you agree on the following(even
ClassNames ?)

On Fri, Feb 5, 2010 at 12:27 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> These are good questions.  I see the best course as answering these kinds
> of
> questions in phases.
>
> First, the only thing that is working right now is the current text =>
> vector stuff.  We should continue to refine this with alternative forms of
> vectorization (random indexing, stochastic projection as well as the
> current
> dictionary approach).
>
> The input all these vectorization job is StucturedDocumentWritable format
which you and Drew will work on(Avro based)

To create the StructuredDocumentWritable format we have to write Mapreduces
which will convert
a) SequenceFile => SingleField token array using Analyzer
     I am going with simple Document
=> StucturedDocumentWritable(encapsulating StringTuple)      in   M1.
     Change it to StucturedDocumentWritable(                     in     M2
b) Lucene Repo  => StucturedDocumentWritable                       M2
c) Structured XML =>  StucturedDocumentWritable                  M2
d) Other Formats/DataSources(RDBMS)  => StucturedDocumentWritable
         M3

Jobs using StructuredDocumentWritable
a) DictionaryVectorizer -> Makes VectorWritable                     M1
b) nGram Generator -> Makes ngrams ->
          1) Appends to the dictionary -> Creates Partial Vectors -> Merges
with vectors from Dictionary Vectorizer to create ngram based vectors     M1
          2) Appends to  other vectorizers(random indexing, stochastic) M1?
or M2
c) Random Indexing Job -> Makes VectorWritable  M1? or M2
d) StochasticProjection Job -> Makes Vector writable  M1? or M2


How does this sound ? Feel free to edit/reorder them



A second step is to be able to store and represent more general documents
> similar to what is possible with Lucene.  This is critically important for
> some of the things that I want to do where I need to store and segregate
> title, publisher, authors, abstracts and body text (and many other
> characteristics ... we probably have >100 of them).  It is also critically
> important if we want to embrace the dualism between recommendation and
> search.  Representing documents can be done without discarding the simpler
> approach we have now and it can be done in advance of good vectorization of
> these complex documents.
>
> A third step is to define advanced vectorization for complex documents.  As
> an interim step, we can simply vectorize using the dictionary and
> alternative vectorizers that we have now, but applied to a single field of
> the document.  Shortly, though, we should be able to define cross
> occurrence
> features for a multi-field vectorization.
>
> The only dependencies here are that the third step depends on the first and
> second.
>
> You have been working on the Dictionary vectorizer.  I did a bit of work on
> stochastic projection with some cooccurrence.
>
> In parallel Drew and I have been working on building an Avro document
> schema.  This is driving forward on step 2.  I think that will actually
> bear
> some fruit quickly.  Once that is done, we should merge capabilities.  I am
> hoping that the good momentum that you have established on (1) will mean
> that merging your vectorization with the complex documents will be
> relatively easy.
>
> Is that a workable idea?
>
> On Thu, Feb 4, 2010 at 10:45 AM, Robin Anil <robin.a...@gmail.com> wrote:
>
> > And how does it
> > work with our sequence file format(string docid => string document>. All
> we
> > have is text=>text ?
> > and finally its all vectors. How does same word in two different fields
> > translate into vector?
> >
> > if you have a clear plan lets do it or lets do the first version with
> just
> >
> > document -> analyzer -> token array -> vector
> >                                                      |-> ngram -> vector
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Reply via email to