On Fri, Aug 12, 2011 at 6:18 AM, Jörn Kottmann <[email protected]> wrote:
> On 8/12/11 4:25 AM, [email protected] wrote: > >> If the text I am processing has any occurrence of a verb present second >> person singular it will crash the tagger! >> > > This should be fixed now, if there are any tags in the dict which are not > maxent model outcomes, the model package validation code will fail to load > it. So now it is at least fail fast. > > > To fix that I am thinking about optionally filter the dictionary entries >> according to the known outcomes, that will be only available after having >> the model trained by our training tool or by the cross validator. So after >> training we could iterate over the entries and remove the tags that are >> unknown by the model. But I am not sure if it is the best approach. >> > You can easily iterate over the training data, and create a set which > contains > all tags which are in the model and then use this set to create/filter your > tag dict. > Should I iterate over the training data or do it after model training? I thought that not every tag would be in the outcome list because of the cutoff. Also it would be difficult to preview which tags would be at the outcome list while performing cross validation because we train with a subset of the corpus.
