I would like to review these changes before they happen.  I have some folks
using these in production
and changes that surprise them would not be good.  (having 0.4 right now may
be a god-send)

On Mon, Oct 4, 2010 at 12:49 PM, Robin Anil <[email protected]> wrote:

> I am ready to move FeatureVectorEncoders under vectorizer.encoder.* . I
> wanted to move them under vectorizer but it seems these are homogenous and
> should be kept separate from jobs and other classes.
>

Moving them will be a momentary pain, but not so bad.


> I need to create a Dictionary Based FeatureEncoder for that I am thinking
> about the following.
>
> I will be renaming FeatureVectorEncoder as ProbedFeatureVectorEncoder
>

I would rather HashedFeatureVectorEncoder if a rename is really necessary. I
am not convinced that it is.


> abstract class
> This abstract class will extend a FeatureEncoder interface having two
> functions int encode(String) and int encode(byte[])
>

I don't understand what these do.  Are they really just a dictionary
interface?


>
> I will implement this interface in two FeatureEncoders: TFTextEncoder and
> TFIDFTextEncoder
>

What about looking at the current TextValueEncoder and simply replacing the
hash
function with a dictionary lookup?  In fact, this might be done at the
WordValueEncoder
level.  setProbes would call IllegalSomethingOrOtherException.

It seems to me that a dictionary based encoder is really no different from
any
hashed feature except that the hash function is based on the dictionary
rather than
a hash function, the weight is derived from the dictionary and the encoder
really only
supports a single probe.

All of this seems doable with one or two sub-classes of the current
TextValueEncoder.
If you want to roll in Lucene based analysis, then sub-classing
LuceneTextValueEncoder
would be better.

Reply via email to