Re: Move DictionaryVectorizer to Core

Robin Anil Mon, 04 Oct 2010 15:08:26 -0700

On Tue, Oct 5, 2010 at 3:00 AM, Ted Dunning <[email protected]> wrote:


> I would like to review these changes before they happen.  I have some folks
> using these in production
> and changes that surprise them would not be good.  (having 0.4 right now
> may
> be a god-send)


OK.

>

 On Mon, Oct 4, 2010 at 12:49 PM, Robin Anil <[email protected]> wrote:
>
> > I am ready to move FeatureVectorEncoders under vectorizer.encoder.* . I
> > wanted to move them under vectorizer but it seems these are homogenous
> and
> > should be kept separate from jobs and other classes.
> >
>
> Moving them will be a momentary pain, but not so bad.
>
Cool. Check the patch.

>
>
> > I need to create a Dictionary Based FeatureEncoder for that I am thinking
> > about the following.
> >
> > I will be renaming FeatureVectorEncoder as ProbedFeatureVectorEncoder
> >
>
> I would rather HashedFeatureVectorEncoder if a rename is really necessary.
> I
> am not convinced that it is.
>
Yeah HashedFeatureVectorEncoder is clearer.

>
>
> > abstract class
> > This abstract class will extend a FeatureEncoder interface having two
> > functions int encode(String) and int encode(byte[])
> >
>
> I don't understand what these do.  Are they really just a dictionary
> interface?
>
I assumed the basic encoder is one which maps a byte to an int.  (Probing is
a overlay technique over it)

>
>
> >
> > I will implement this interface in two FeatureEncoders: TFTextEncoder and
> > TFIDFTextEncoder
> >
>
> What about looking at the current TextValueEncoder and simply replacing the
> hash
> function with a dictionary lookup?  In fact, this might be done at the
> WordValueEncoder
> level.  setProbes would call IllegalSomethingOrOtherException.


> It seems to me that a dictionary based encoder is really no different from
> any
> hashed feature except that the hash function is based on the dictionary
> rather than
> a hash function, the weight is derived from the dictionary and the encoder
> really only
> supports a single probe.
>
> All of this seems doable with one or two sub-classes of the current
> TextValueEncoder.
> If you want to roll in Lucene based analysis, then sub-classing
> LuceneTextValueEncoder
> would be better.
>
I can proceed this way as well. Just need to move it around.  So are you
more comfortable this way i.e. By throwing exception with probes?

Re: Move DictionaryVectorizer to Core

Reply via email to