Re: Document Clustering

Grant Ingersoll Thu, 28 May 2009 08:29:07 -0700

It sounds like a start. Can you open a JIRA and attach a patch? Istill am not sure if Lucene is totally the way to go on it. I supposeeventually we need a way to put things in a common format like ARFFand then just have transformers to it from other formats. Come tothink of it, maybe it makes sense to have a Tika ContentHandler thatcan output ARFF or whatever other format we want. This would maketranslating input docs dead simple.

Then again, maybe a real Pipeline is the answer. I know Solr, etc.could benefit from one too, but that is a whole different ball of wax.



On May 28, 2009, at 10:32 AM, Shashikant Kore wrote:

Hi Grant,

I have the code to create lucene index from document text and then
generate document vectors from it.  This is stand-alone code and not
MR.  Is it something that interests you?

--shashi
On Thu, May 28, 2009 at 5:57 PM, Grant Ingersoll<[email protected]> wrote:
I'm about to write some code to prepare docs for clustering and Iknow atleast a few others on the list here have done the same. I waswondering ifanyone is in the position to share their code and contribute toMahout.
As I see it, we need to be able to take in text and create thematrix ofterms, where each cell is the TF/IDF (or some other weight, wouldbe nice tobe pluggable) and then normalize the vector (and, according to Ted,weshould support using different norms). Seems like we also needthe labelstuff in place (https://issues.apache.org/jira/browse/MAHOUT-65)but I'm not
sure on the state of that patch.
As for the TF/IDF stuff, we sort of have it via theBayesTfIdfDriver, but itneeds to be a more generic. I realize we could use Lucene, buthaving asolution that scales w/ Lucene is going to take work, AIUI, whereasa M/R
job seems more straightforward.
I'd like to be able to get this stuff committed relatively soon andhave theexamples for other people. My shorter term goal is I'm working onsome
demos using Wikipedia.

Thanks,
Grant


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Document Clustering

Reply via email to