On Fri, Aug 12, 2011 at 8:04 AM, Jörn Kottmann <[email protected]> wrote:

> On 8/12/11 12:53 PM, [email protected] wrote:
>
>> Should I iterate over the training data or do it after model training? I
>> thought that not every tag would be in the outcome list because of the
>> cutoff. Also it would be difficult to preview which tags would be at the
>> outcome list while performing cross validation because we train with a
>> subset of the corpus.
>>
>
> Well there you got two points. You can try to use the perceptron, that is
> usually
> trained without a cutoff. Anyway that doesn't really help you for the cross
> validation.
> Maybe you can add a little training data to your corpus, so you are
> covering all tags?
>

That is a good idea, but I would have to strategically distribute the the
sentences around the corpus to make sure the training partition of cross
validation will use these sentences. I'll probably need to build a better
corpus anyway.

If you know the tags which are causing trouble you might just want to remove
> all
> tokens from your dictionary which contain them. Removing a few words will
> not
> make a big difference in accuracy anyway.
>

Doing it during training is not a good idea? I thought it would help other
people.


>
> Sorry for not having a better answer.
>
> Our current POS Tagger is completely statistical, to improve your situation
> we would
> need an hybrid approach, where we it can fallback to some rules in case the
> statistical
> decision is not plausible according to a tag dict, or other rules.
>
> We also had a user here, who wanted to define short sequences in a tag
> dict, to fix mistakes
> he observed in the output of the tagger.
>
> Maybe both things could be done for 1.6. What do you think?
>

Yes, an hybrid approach would add some flexibility. We can discuss it for
1.6.

Reply via email to