On Thu, Feb 4, 2010 at 10:29 PM, Drew Farris <drew.far...@gmail.com> wrote:
> On Thu, Feb 4, 2010 at 10:51 AM, Robin Anil <robin.a...@gmail.com> wrote: > > >> > >> Document Directory -> Document Sequence File > >> Document Sequence File -> Document Token Streams > >> Document Token Streams -> Document Vectors + Dictionary > >> > > Ok I will work on this Job. > > FWIW, Ted had proposed something on the order of allowing Documents to > have multiple named Fields, where each field has an independent token > stream. Likewise, Document sequence files could have multiple fields > per Document where each field is a string. What do you think about > something like this? The documents I work with day to day in > production are more frequently field structured than flat and in some > cases fields are tokenized while others are simply untouched. I > > Tell me What the schema it should be List<List<String>> ? And how does it work with our sequence file format(string docid => string document>. All we have is text=>text ? and finally its all vectors. How does same word in two different fields translate into vector? if you have a clear plan lets do it or lets do the first version with just document -> analyzer -> token array -> vector |-> ngram -> vector > Also partial Vector merger could be reused by colloc when creating ngram > > only vectors. But we need to keep adding to the dictionary file. If you > can > > work on a dictionary merger + chunker, it will be great. I think we can > do > > this integration quickly > > I'll take a closer look at the Dictionary code you're produced and see > what I can come up with -- is the basic idea here to take multiple > dictionaries with potentially overlapping ID's and merge them into a > single dictionary? What needs to happen with regards to chunking? Lets not have overlapping ids otherwise it becomes a pain to merge. have unique ids in sequence file, and a file with last id used ? > Drew >