Hi, Svetoslav,

On Fri, Jan 27, 2012 at 1:04 PM, Svetoslav Marinov <
[email protected]> wrote:

> Hi all,
>
> I have asked this question earlier in another thread but did not get
> answer.
>
> I would like to train new models for Swedish for sentence detection,
> tokenization, POStagging, NER since the existing models seem to perform
> poorly on my data.
>
> I read the documentation and I surely followed the steps for training a
> model from command line or the API, however,  I am still not very happy
> with the results. The existing Swedish models are trained on a small
> Swedish corpus (Talbanken), while I have access to a much larger training
> set (SUC corpus) .
>
> Here are the problems I face:
>
>  1.  current Swedish model fails when sentences start with lower-case
> letters. Also when there is no space between the full-stop and the next
> sentence, the model splits the sentence but keeps the first word of the
> second sentence as part of the first sentence. My current model cannot
> split at all if there is no space after the full stop.
>

Your production data should have the same characteristics of your train
data. It will only handle cases like sentences that starts with lower-case
letters and no space between the full-stop and the next sentence properly
if these were covered by the training data. You can add sentences with
these characteristics to your training data.


>
> Questions: Can one influence the training set by creating examples where
> there is no space between two sentences? Can one add features to the model?
> If so how? What about if  the training data contains sentences without end
> of sentence markers (fullstops, etc.)? Should such exist in the training
> set? Is there some example/documentation about it?
> At least I cannot seem to find it – but correct me if I am wrong.
>

Yes, simply append it to your training data. I don't think the
documentation covers how to create training corpus, it only explains the
format, but any help to improve it is highly appreciated.


> 2. The NER task suffers from similar problems. My current model has a hard
> time recognizing single names but does OK is the name consists of several
> words. However, I would like to include POS tag information as part of the
> training features. Is this possible? If so how? Any examples/documentation
> about it?
>

Does your corpus include examples of single names? Is it easy to
distinguish it from other tokens? Maybe you should consider using a Custom
Feature 
Generation<http://incubator.apache.org/opennlp/documentation/1.5.2-incubating/manual/opennlp.html#tools.namefind.training.featuregen>
to
add a "dictionary" element. You can use the DictionaryBuilder tool
"bin/opennlp DictionaryBuilder"

I think you can add POS Tag information, but I don't know exactly how to do
it. I would investigate if it is possible using the "custom" element
of the Custom
Feature 
Generation<http://incubator.apache.org/opennlp/documentation/1.5.2-incubating/manual/opennlp.html#tools.namefind.training.featuregen>,
were you can pass a class that implements "AdaptiveFeatureGenerator".

Another possible way, maybe even simpler, is to use the additionalContext
argument of the NameFinder to pass the POS Tag info while training and
executing the name finder. Should work, but I never tried it.

Regards,
William

Reply via email to