On Thu, Feb 4, 2010 at 10:51 AM, Robin Anil <robin.a...@gmail.com> wrote:
>> >> Document Directory -> Document Sequence File >> Document Sequence File -> Document Token Streams >> Document Token Streams -> Document Vectors + Dictionary >> > Ok I will work on this Job. FWIW, Ted had proposed something on the order of allowing Documents to have multiple named Fields, where each field has an independent token stream. Likewise, Document sequence files could have multiple fields per Document where each field is a string. What do you think about something like this? The documents I work with day to day in production are more frequently field structured than flat and in some cases fields are tokenized while others are simply untouched. I > Also partial Vector merger could be reused by colloc when creating ngram > only vectors. But we need to keep adding to the dictionary file. If you can > work on a dictionary merger + chunker, it will be great. I think we can do > this integration quickly I'll take a closer look at the Dictionary code you're produced and see what I can come up with -- is the basic idea here to take multiple dictionaries with potentially overlapping ID's and merge them into a single dictionary? What needs to happen with regards to chunking? Drew