Hi, there,
Thanks, Ted; you've very helpful!
OK, w.r.t. my second question, perhaps I wasn't clear enough; I'm not
trying to cluster large text documents. Maybe it'd be better if I gave
an example?
Suppose you're clustering apples. Your feature vector for each apple might
contain things like, say, the address of the farm where the apple came from
and a person's description of the flavor of the apple ("tangy / fantastic /
eyebrow-raising"), and nothing else.
Your first suggestion is referring to a VSM, a vector where each i^th
entry is the frequency count (or length, or whatever) of the i^th word
in the text item, I presume? And with hash-coding, you have a fixed
vocabulary; which for whatever reasons might not be ideal. While these
are both useful, I don't think they're exactly applicable, here (I
should have been more specific; sorry). What would be really ideal was
if the vector could contain string data -- is there any way to get
around this?
Thanks, everyone, for taking the time to read this!
Brett
On 7/22/11, Ted Dunning <[email protected]> wrote:
> Copying dev list with Brett's permission.
>
> On Thu, Jul 21, 2011 at 8:00 PM, Brett Wines <[email protected]>wrote:
>
>>
>> I'm writing in response to a question on a Mahout
>> forum<http://search.lucidimagination.com/search/document/943a194edea159fc/string_clustering_and_other_newbie_questions>;
>> I was wondering if you could answer a question or two for me?
>>
>
> Sure.
>
>
>> First, do you know if there's a good way to plug in one's own
>> centroid-computing function for Mahout algorithms like k-means or EM?
>>
>
> Actually, I am not entirely sure. I think that there is.
>
> It is definitely true that there is a good way to plug in a new distance
> function for computing cluster membership.
>
> Jeff, is it easy to plug in a new centroid function? I think that you said
> yes to this as part of the classification/clustering unification work.
>
> Second, do you know if there's any way at all to run Mahout clustering
>> algorithms on things where the features aren't numbers? The vectors don't
>> support anything except for doubles and it and it's hack-y and messy to
>> map
>> non-numerical feature data to arbitrary numbers and then in a custom
>> distance function undo the mapping (the DistanceMeasure interface requires
>> the comparison function to take in Vectors as parameters) and there's got
>> to
>> be a better solution.
>>
>
>
> Hmm... I am not clear on all of your requirements, but there are at least
> two methods for doing this.
>
> One method that is commonly used to do this is to do classic vector space
> conversion of text-like data. With this, there is one dimension in the
> feature vector per unique word. There wide support for this with clustering
> in Mahout. This is also easy to reverse engineer, but it doesn't support
> stupendous or open-ended vocabularies.
>
> Another method is to use hash-encoding. This allows combinations of
> continuous, text-like and word-like data into a fixed size vector that is
> merely large instead of stupendous in size. This representation is nice and
> consistent, but it can be difficult to reverse engineer.
>