I would like to review these changes before they happen. I have some folks using these in production and changes that surprise them would not be good. (having 0.4 right now may be a god-send)
On Mon, Oct 4, 2010 at 12:49 PM, Robin Anil <[email protected]> wrote: > I am ready to move FeatureVectorEncoders under vectorizer.encoder.* . I > wanted to move them under vectorizer but it seems these are homogenous and > should be kept separate from jobs and other classes. > Moving them will be a momentary pain, but not so bad. > I need to create a Dictionary Based FeatureEncoder for that I am thinking > about the following. > > I will be renaming FeatureVectorEncoder as ProbedFeatureVectorEncoder > I would rather HashedFeatureVectorEncoder if a rename is really necessary. I am not convinced that it is. > abstract class > This abstract class will extend a FeatureEncoder interface having two > functions int encode(String) and int encode(byte[]) > I don't understand what these do. Are they really just a dictionary interface? > > I will implement this interface in two FeatureEncoders: TFTextEncoder and > TFIDFTextEncoder > What about looking at the current TextValueEncoder and simply replacing the hash function with a dictionary lookup? In fact, this might be done at the WordValueEncoder level. setProbes would call IllegalSomethingOrOtherException. It seems to me that a dictionary based encoder is really no different from any hashed feature except that the hash function is based on the dictionary rather than a hash function, the weight is derived from the dictionary and the encoder really only supports a single probe. All of this seems doable with one or two sub-classes of the current TextValueEncoder. If you want to roll in Lucene based analysis, then sub-classing LuceneTextValueEncoder would be better.
