Re: Document Clustering

Shashikant Kore Sun, 14 Jun 2009 05:22:40 -0700

Hi Grant,

Apologies for not responding to your patch and comments.


I will take a look at it and send in my feedback.

Regards,
--shashi


On Sat, Jun 13, 2009 at 6:13 PM, Grant Ingersoll<[email protected]> wrote:
> Hi Shashi,
>
> Was wondering what you thought of my updates to MAHOUT-126?
>  https://issues.apache.org/jira/browse/MAHOUT-126
>
> -Grant
>
> On May 28, 2009, at 10:32 AM, Shashikant Kore wrote:
>
>> Hi Grant,
>>
>> I have the code to create lucene index from document text and then
>> generate document vectors from it.  This is stand-alone code and not
>> MR.  Is it something that interests you?
>>
>> --shashi
>>
>> On Thu, May 28, 2009 at 5:57 PM, Grant Ingersoll <[email protected]>
>> wrote:
>>>
>>> I'm about to write some code to prepare docs for clustering and I know at
>>> least a few others on the list here have done the same.  I was wondering
>>> if
>>> anyone is in the position to share their code and contribute to Mahout.
>>>
>>> As I see it, we need to be able to take in text and create the matrix of
>>> terms, where each cell is the TF/IDF (or some other weight, would be nice
>>> to
>>> be pluggable) and then normalize the vector (and, according to Ted, we
>>> should support using different norms).   Seems like we also need the
>>> label
>>> stuff in place (https://issues.apache.org/jira/browse/MAHOUT-65) but I'm
>>> not
>>> sure on the state of that patch.
>>>
>>> As for the TF/IDF stuff, we sort of have it via the BayesTfIdfDriver, but
>>> it
>>> needs to be a more generic.  I realize we could use Lucene, but having a
>>> solution that scales w/ Lucene is going to take work, AIUI, whereas a M/R
>>> job seems more straightforward.
>>>
>>> I'd like to be able to get this stuff committed relatively soon and have
>>> the
>>> examples for other people.  My shorter term goal is I'm working on some
>>> demos using Wikipedia.
>>>
>>> Thanks,
>>> Grant
>>>
>>>
>>>
>
>

Re: Document Clustering

Reply via email to