Drew has an early code drop that should be posted shortly. He has a generic AvroWritable that can serialize anything with an appropriate schema. That changes your names and philosophy a bit.
Regarding n-grams, I think that will be best combined with a non-dictionary based vectorizer because of the large implied vocabulary that would otherwise result. Also, in many cases vectorization and n-gram generation is best done in the learning algorithm itself to avoid moving massive amounts of data. As such, vectorization will probably need to be a library rather than a map-reduce program. On Thu, Feb 4, 2010 at 7:49 PM, Robin Anil <robin.a...@gmail.com> wrote: > Lets break it down into milestones. See if you agree on the following(even > ClassNames ?) > > On Fri, Feb 5, 2010 at 12:27 AM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > > > These are good questions. I see the best course as answering these kinds > > of > > questions in phases. > > > > First, the only thing that is working right now is the current text => > > vector stuff. We should continue to refine this with alternative forms > of > > vectorization (random indexing, stochastic projection as well as the > > current > > dictionary approach). > > > > The input all these vectorization job is StucturedDocumentWritable format > which you and Drew will work on(Avro based) > > To create the StructuredDocumentWritable format we have to write Mapreduces > which will convert > a) SequenceFile => SingleField token array using Analyzer > I am going with simple Document > => StucturedDocumentWritable(encapsulating StringTuple) in M1. > Change it to StucturedDocumentWritable( in M2 > b) Lucene Repo => StucturedDocumentWritable M2 > c) Structured XML => StucturedDocumentWritable M2 > d) Other Formats/DataSources(RDBMS) => StucturedDocumentWritable > M3 > > Jobs using StructuredDocumentWritable > a) DictionaryVectorizer -> Makes VectorWritable M1 > b) nGram Generator -> Makes ngrams -> > 1) Appends to the dictionary -> Creates Partial Vectors -> Merges > with vectors from Dictionary Vectorizer to create ngram based vectors > M1 > 2) Appends to other vectorizers(random indexing, stochastic) M1? > or M2 > c) Random Indexing Job -> Makes VectorWritable M1? or M2 > d) StochasticProjection Job -> Makes Vector writable M1? or M2 > > > How does this sound ? Feel free to edit/reorder them > > > > A second step is to be able to store and represent more general documents > > similar to what is possible with Lucene. This is critically important > for > > some of the things that I want to do where I need to store and segregate > > title, publisher, authors, abstracts and body text (and many other > > characteristics ... we probably have >100 of them). It is also > critically > > important if we want to embrace the dualism between recommendation and > > search. Representing documents can be done without discarding the > simpler > > approach we have now and it can be done in advance of good vectorization > of > > these complex documents. > > > > A third step is to define advanced vectorization for complex documents. > As > > an interim step, we can simply vectorize using the dictionary and > > alternative vectorizers that we have now, but applied to a single field > of > > the document. Shortly, though, we should be able to define cross > > occurrence > > features for a multi-field vectorization. > > > > The only dependencies here are that the third step depends on the first > and > > second. > > > > You have been working on the Dictionary vectorizer. I did a bit of work > on > > stochastic projection with some cooccurrence. > > > > In parallel Drew and I have been working on building an Avro document > > schema. This is driving forward on step 2. I think that will actually > > bear > > some fruit quickly. Once that is done, we should merge capabilities. I > am > > hoping that the good momentum that you have established on (1) will mean > > that merging your vectorization with the complex documents will be > > relatively easy. > > > > Is that a workable idea? > > > > On Thu, Feb 4, 2010 at 10:45 AM, Robin Anil <robin.a...@gmail.com> > wrote: > > > > > And how does it > > > work with our sequence file format(string docid => string document>. > All > > we > > > have is text=>text ? > > > and finally its all vectors. How does same word in two different fields > > > translate into vector? > > > > > > if you have a clear plan lets do it or lets do the first version with > > just > > > > > > document -> analyzer -> token array -> vector > > > |-> ngram -> > vector > > > > > > > > > > > -- > > Ted Dunning, CTO > > DeepDyve > > > -- Ted Dunning, CTO DeepDyve