Hi, I am thinking of adding a new feature for the POS Tagger component and I would appreciate some comments.
POS Tagger effectiveness increases a lot with a POSDictionary, but today the only option is to provide one. It would be nice if we could induce the dictionary from training data, or expand the existing dictionary with the training data. To activate that the user could pass in a cutoff value. Only word + tag with frequency higher than the cutoff should be added to the dictionary. While performing cross validation we should keep in mind that we can only expand / create a dictionary using the training portion of the corpus. The only problem I see now is how we should create / expand this dictionary if we are using the new Factory mechanism. One issue is that the tools can not access the dictionary directly, also, depending on the dictionary implementation we are using, maybe the factory itself should perform the task of populating it. The base Factory implementation should implement it for the default POSDictionary. In this case, I would add the following methods to the POSTaggerFactory: 1) expandPOSDictionary( TrainingSampleStream<POSSample> samples, Integer cutoff, boolean keepOriginal ); This method would expand / create the dictionary using the data from samples, respecting the cutoff. The argument keepOriginal is used to inform the implementation that it should backup the original dictionary 2) restorePOSDictionary(); Restores the dictionary backup to start another cross-validation What do you think? I am not sure this feature would help others, also I don't like the POSTaggerFactory to take this responsibility, but I can't see a cleaner option right now. Thank you, William
