On Feb 9, 2010, at 9:56 AM, Robin Anil wrote: > Yeah that sounds ok. Do we have the pure content without html ?
No, but I was just thinking yesterday that a really nice enhancement to the Doc. Vectorizer would be to hook in Tika, such that one could M/R binary files into Mahout vectors. Thoughts? Tika integration should be pretty trivial. I can likely help later in the week. -Grant