Yeah!. Tika looks great!. I bet Drew's patch to create a structured document
format via Avro should essentially go into Tika. Then we could really use
the Tika library to the full.

I should really spend time to explore Apache projects. I think we could
reuse a whole lot.
Robin



On Tue, Feb 9, 2010 at 8:30 PM, Grant Ingersoll <gsing...@apache.org> wrote:

>
> On Feb 9, 2010, at 9:56 AM, Robin Anil wrote:
>
> > Yeah that sounds ok. Do we have the pure content without html ?
>
> No, but I was just thinking yesterday that a really nice enhancement to the
> Doc. Vectorizer would be to hook in Tika, such that one could M/R binary
> files into Mahout vectors.  Thoughts?  Tika integration should be pretty
> trivial.  I can likely help later in the week.
>
> -Grant

Reply via email to