The vector will not contain string data when passed to the classifier
itself.

The input data structure may contain strings but the encoding will need to
convert them to numbers.  Often, the encoding is bolted pretty tightly onto
the actual learning algorithm so it doesn't look like this, but it works
that way with all of the algorithms in Mahout.

With hash coding, btw, you do NOT have a fixed vocabulary.  You have a fixed
feature vector size and an unbounded vocabulary.

On Fri, Jul 22, 2011 at 12:01 AM, Brett Wines <[email protected]>wrote:

> Your first suggestion is referring to a VSM, a vector where each i^th
> entry is the frequency count (or length, or whatever) of the i^th word
> in the text item, I presume? And with hash-coding, you have a fixed
> vocabulary; which for whatever reasons might not be ideal. While these
> are both useful, I don't think they're exactly applicable, here (I
> should have been more specific; sorry). What would be really ideal was
> if the vector could contain string data -- is there any way to get
> around this?
>

Reply via email to