Thanks for the answer, William!

I was pretty busy with other stuff but I soon try to find time for the
suggestions you gave me. I hope these will improve the performance.

Best,
Svetoslav

On 2012-01-30 19:12, "[email protected]" <[email protected]>
wrote:

>Hi, Svetoslav,
>
>
>On Fri, Jan 27, 2012 at 1:04 PM, Svetoslav Marinov <
>[email protected]> wrote:
>
>> Hi all,
>>
>> I have asked this question earlier in another thread but did not get
>> answer.
>>
>> I would like to train new models for Swedish for sentence detection,
>> tokenization, POStagging, NER since the existing models seem to perform
>> poorly on my data.
>>
>> I read the documentation and I surely followed the steps for training a
>> model from command line or the API, however,  I am still not very happy
>> with the results. The existing Swedish models are trained on a small
>> Swedish corpus (Talbanken), while I have access to a much larger
>>training
>> set (SUC corpus) .
>>
>> Here are the problems I face:
>>
>>  1.  current Swedish model fails when sentences start with lower-case
>> letters. Also when there is no space between the full-stop and the next
>> sentence, the model splits the sentence but keeps the first word of the
>> second sentence as part of the first sentence. My current model cannot
>> split at all if there is no space after the full stop.
>>
>
>Your production data should have the same characteristics of your train
>data. It will only handle cases like sentences that starts with lower-case
>letters and no space between the full-stop and the next sentence properly
>if these were covered by the training data. You can add sentences with
>these characteristics to your training data.
>
>
>>
>> Questions: Can one influence the training set by creating examples where
>> there is no space between two sentences? Can one add features to the
>>model?
>> If so how? What about if  the training data contains sentences without
>>end
>> of sentence markers (fullstops, etc.)? Should such exist in the training
>> set? Is there some example/documentation about it?
>> At least I cannot seem to find it ­ but correct me if I am wrong.
>>
>
>Yes, simply append it to your training data. I don't think the
>documentation covers how to create training corpus, it only explains the
>format, but any help to improve it is highly appreciated.
>
>
>> 2. The NER task suffers from similar problems. My current model has a
>>hard
>> time recognizing single names but does OK is the name consists of
>>several
>> words. However, I would like to include POS tag information as part of
>>the
>> training features. Is this possible? If so how? Any
>>examples/documentation
>> about it?
>>
>
>Does your corpus include examples of single names? Is it easy to
>distinguish it from other tokens? Maybe you should consider using a Custom
>Feature 
>Generation<http://incubator.apache.org/opennlp/documentation/1.5.2-incubat
>ing/manual/opennlp.html#tools.namefind.training.featuregen>
>to
>add a "dictionary" element. You can use the DictionaryBuilder tool
>"bin/opennlp DictionaryBuilder"
>
>I think you can add POS Tag information, but I don't know exactly how to
>do
>it. I would investigate if it is possible using the "custom" element
>of the Custom
>Feature 
>Generation<http://incubator.apache.org/opennlp/documentation/1.5.2-incubat
>ing/manual/opennlp.html#tools.namefind.training.featuregen>,
>were you can pass a class that implements "AdaptiveFeatureGenerator".
>
>Another possible way, maybe even simpler, is to use the additionalContext
>argument of the NameFinder to pass the POS Tag info while training and
>executing the name finder. Should work, but I never tried it.
>
>Regards,
>William

Reply via email to