Hi Grant, Apologies for not responding to your patch and comments.
I will take a look at it and send in my feedback. Regards, --shashi On Sat, Jun 13, 2009 at 6:13 PM, Grant Ingersoll<[email protected]> wrote: > Hi Shashi, > > Was wondering what you thought of my updates to MAHOUT-126? > https://issues.apache.org/jira/browse/MAHOUT-126 > > -Grant > > On May 28, 2009, at 10:32 AM, Shashikant Kore wrote: > >> Hi Grant, >> >> I have the code to create lucene index from document text and then >> generate document vectors from it. This is stand-alone code and not >> MR. Is it something that interests you? >> >> --shashi >> >> On Thu, May 28, 2009 at 5:57 PM, Grant Ingersoll <[email protected]> >> wrote: >>> >>> I'm about to write some code to prepare docs for clustering and I know at >>> least a few others on the list here have done the same. I was wondering >>> if >>> anyone is in the position to share their code and contribute to Mahout. >>> >>> As I see it, we need to be able to take in text and create the matrix of >>> terms, where each cell is the TF/IDF (or some other weight, would be nice >>> to >>> be pluggable) and then normalize the vector (and, according to Ted, we >>> should support using different norms). Seems like we also need the >>> label >>> stuff in place (https://issues.apache.org/jira/browse/MAHOUT-65) but I'm >>> not >>> sure on the state of that patch. >>> >>> As for the TF/IDF stuff, we sort of have it via the BayesTfIdfDriver, but >>> it >>> needs to be a more generic. I realize we could use Lucene, but having a >>> solution that scales w/ Lucene is going to take work, AIUI, whereas a M/R >>> job seems more straightforward. >>> >>> I'd like to be able to get this stuff committed relatively soon and have >>> the >>> examples for other people. My shorter term goal is I'm working on some >>> demos using Wikipedia. >>> >>> Thanks, >>> Grant >>> >>> >>> > >
