Yeah!. Tika looks great!. I bet Drew's patch to create a structured document format via Avro should essentially go into Tika. Then we could really use the Tika library to the full.
I should really spend time to explore Apache projects. I think we could reuse a whole lot. Robin On Tue, Feb 9, 2010 at 8:30 PM, Grant Ingersoll <gsing...@apache.org> wrote: > > On Feb 9, 2010, at 9:56 AM, Robin Anil wrote: > > > Yeah that sounds ok. Do we have the pure content without html ? > > No, but I was just thinking yesterday that a really nice enhancement to the > Doc. Vectorizer would be to hook in Tika, such that one could M/R binary > files into Mahout vectors. Thoughts? Tika integration should be pretty > trivial. I can likely help later in the week. > > -Grant