On Fri, Aug 12, 2011 at 8:04 AM, Jörn Kottmann <[email protected]> wrote:
> On 8/12/11 12:53 PM, [email protected] wrote: > >> Should I iterate over the training data or do it after model training? I >> thought that not every tag would be in the outcome list because of the >> cutoff. Also it would be difficult to preview which tags would be at the >> outcome list while performing cross validation because we train with a >> subset of the corpus. >> > > Well there you got two points. You can try to use the perceptron, that is > usually > trained without a cutoff. Anyway that doesn't really help you for the cross > validation. > Maybe you can add a little training data to your corpus, so you are > covering all tags? > That is a good idea, but I would have to strategically distribute the the sentences around the corpus to make sure the training partition of cross validation will use these sentences. I'll probably need to build a better corpus anyway. If you know the tags which are causing trouble you might just want to remove > all > tokens from your dictionary which contain them. Removing a few words will > not > make a big difference in accuracy anyway. > Doing it during training is not a good idea? I thought it would help other people. > > Sorry for not having a better answer. > > Our current POS Tagger is completely statistical, to improve your situation > we would > need an hybrid approach, where we it can fallback to some rules in case the > statistical > decision is not plausible according to a tag dict, or other rules. > > We also had a user here, who wanted to define short sequences in a tag > dict, to fix mistakes > he observed in the output of the tagger. > > Maybe both things could be done for 1.6. What do you think? > Yes, an hybrid approach would add some flexibility. We can discuss it for 1.6.
