On Thu, Feb 4, 2010 at 10:29 PM, Drew Farris <drew.far...@gmail.com> wrote:

> On Thu, Feb 4, 2010 at 10:51 AM, Robin Anil <robin.a...@gmail.com> wrote:
>
> >>
> >> Document Directory -> Document Sequence File
> >> Document Sequence File -> Document Token Streams
> >> Document Token Streams -> Document Vectors + Dictionary
> >>
> > Ok I will work on this Job.
>
> FWIW, Ted had proposed something on the order of allowing Documents to
> have multiple named Fields, where each field has an independent token
> stream. Likewise, Document sequence files could have multiple fields
> per Document where each field is a string. What do you think about
> something like this? The documents I work with day to day in
> production are more frequently field structured than flat and in some
> cases fields are tokenized while others are simply untouched. I
>
>  Tell me What the schema it should be List<List<String>> ? And how does it
work with our sequence file format(string docid => string document>. All we
have is text=>text ?
and finally its all vectors. How does same word in two different fields
translate into vector?

if you have a clear plan lets do it or lets do the first version with just

document -> analyzer -> token array -> vector
                                                      |-> ngram -> vector

> Also partial Vector merger could be reused by colloc when creating ngram
> > only vectors. But we need to keep adding to the dictionary file. If you
> can
> > work on a dictionary merger + chunker, it will be great. I think we can
> do
> > this integration quickly
>
> I'll take a closer look at the Dictionary code you're produced and see
> what I can come up with -- is the basic idea here to take multiple
> dictionaries with potentially overlapping ID's and merge them into a
> single dictionary? What needs to happen with regards to chunking?

Lets not have overlapping ids otherwise it becomes a pain to merge. have
unique ids in sequence file, and a file with last id used ?

>

Drew
>

Reply via email to