Hi Shashi,
Was wondering what you thought of my updates to MAHOUT-126?
https://issues.apache.org/jira/browse/MAHOUT-126
-Grant
On May 28, 2009, at 10:32 AM, Shashikant Kore wrote:
Hi Grant,
I have the code to create lucene index from document text and then
generate document vectors from it. This is stand-alone code and not
MR. Is it something that interests you?
--shashi
On Thu, May 28, 2009 at 5:57 PM, Grant Ingersoll
<[email protected]> wrote:
I'm about to write some code to prepare docs for clustering and I
know at
least a few others on the list here have done the same. I was
wondering if
anyone is in the position to share their code and contribute to
Mahout.
As I see it, we need to be able to take in text and create the
matrix of
terms, where each cell is the TF/IDF (or some other weight, would
be nice to
be pluggable) and then normalize the vector (and, according to Ted,
we
should support using different norms). Seems like we also need
the label
stuff in place (https://issues.apache.org/jira/browse/MAHOUT-65)
but I'm not
sure on the state of that patch.
As for the TF/IDF stuff, we sort of have it via the
BayesTfIdfDriver, but it
needs to be a more generic. I realize we could use Lucene, but
having a
solution that scales w/ Lucene is going to take work, AIUI, whereas
a M/R
job seems more straightforward.
I'd like to be able to get this stuff committed relatively soon and
have the
examples for other people. My shorter term goal is I'm working on
some
demos using Wikipedia.
Thanks,
Grant