These are good questions. I see the best course as answering these kinds of questions in phases.
First, the only thing that is working right now is the current text => vector stuff. We should continue to refine this with alternative forms of vectorization (random indexing, stochastic projection as well as the current dictionary approach). A second step is to be able to store and represent more general documents similar to what is possible with Lucene. This is critically important for some of the things that I want to do where I need to store and segregate title, publisher, authors, abstracts and body text (and many other characteristics ... we probably have >100 of them). It is also critically important if we want to embrace the dualism between recommendation and search. Representing documents can be done without discarding the simpler approach we have now and it can be done in advance of good vectorization of these complex documents. A third step is to define advanced vectorization for complex documents. As an interim step, we can simply vectorize using the dictionary and alternative vectorizers that we have now, but applied to a single field of the document. Shortly, though, we should be able to define cross occurrence features for a multi-field vectorization. The only dependencies here are that the third step depends on the first and second. You have been working on the Dictionary vectorizer. I did a bit of work on stochastic projection with some cooccurrence. In parallel Drew and I have been working on building an Avro document schema. This is driving forward on step 2. I think that will actually bear some fruit quickly. Once that is done, we should merge capabilities. I am hoping that the good momentum that you have established on (1) will mean that merging your vectorization with the complex documents will be relatively easy. Is that a workable idea? On Thu, Feb 4, 2010 at 10:45 AM, Robin Anil <robin.a...@gmail.com> wrote: > And how does it > work with our sequence file format(string docid => string document>. All we > have is text=>text ? > and finally its all vectors. How does same word in two different fields > translate into vector? > > if you have a clear plan lets do it or lets do the first version with just > > document -> analyzer -> token array -> vector > |-> ngram -> vector > -- Ted Dunning, CTO DeepDyve