On Tue, Oct 5, 2010 at 3:00 AM, Ted Dunning <[email protected]> wrote:
> I would like to review these changes before they happen. I have some folks > using these in production > and changes that surprise them would not be good. (having 0.4 right now > may > be a god-send) OK. > On Mon, Oct 4, 2010 at 12:49 PM, Robin Anil <[email protected]> wrote: > > > I am ready to move FeatureVectorEncoders under vectorizer.encoder.* . I > > wanted to move them under vectorizer but it seems these are homogenous > and > > should be kept separate from jobs and other classes. > > > > Moving them will be a momentary pain, but not so bad. > Cool. Check the patch. > > > > I need to create a Dictionary Based FeatureEncoder for that I am thinking > > about the following. > > > > I will be renaming FeatureVectorEncoder as ProbedFeatureVectorEncoder > > > > I would rather HashedFeatureVectorEncoder if a rename is really necessary. > I > am not convinced that it is. > Yeah HashedFeatureVectorEncoder is clearer. > > > > abstract class > > This abstract class will extend a FeatureEncoder interface having two > > functions int encode(String) and int encode(byte[]) > > > > I don't understand what these do. Are they really just a dictionary > interface? > I assumed the basic encoder is one which maps a byte to an int. (Probing is a overlay technique over it) > > > > > > I will implement this interface in two FeatureEncoders: TFTextEncoder and > > TFIDFTextEncoder > > > > What about looking at the current TextValueEncoder and simply replacing the > hash > function with a dictionary lookup? In fact, this might be done at the > WordValueEncoder > level. setProbes would call IllegalSomethingOrOtherException. > It seems to me that a dictionary based encoder is really no different from > any > hashed feature except that the hash function is based on the dictionary > rather than > a hash function, the weight is derived from the dictionary and the encoder > really only > supports a single probe. > > All of this seems doable with one or two sub-classes of the current > TextValueEncoder. > If you want to roll in Lucene based analysis, then sub-classing > LuceneTextValueEncoder > would be better. > I can proceed this way as well. Just need to move it around. So are you more comfortable this way i.e. By throwing exception with probes?
