Re: Unicode danda in sentence detector

Jörn Kottmann Thu, 31 May 2012 01:07:00 -0700

The wikipedia reference says its commonly used for
Indian languages, maybe we just should just include them,
e.g. like we did for Portuguese.


On the other side we might also need custom feature
generation to get good results.
How are words are delimited in Indian? With spaces?

I suggest to first test with passing in the danda char,
measure how it performs, and then decide if we might also
need an adaption of the feature generation for Indian languages.

Do you have training data you can train it on? If there is a publicly
available data set me would appreciate having format support for it
directly in OpenNLP.

What do you think?

Jörn

On 05/31/2012 03:35 AM, William Colen wrote:

As far as I know you don't need a CLA for a patch. Simply open a Jira and
attach your patch to it.

Besides what James pointed, you may also want change the EOS characters.
There are two related new features that are already implemented in the
trunk:

https://issues.apache.org/jira/browse/OPENNLP-428
This one added an optional command line argument where you set the
end-of-sentence characters. This setting will be persisted to the model. If
you are using the API you can create a SentenceDetectorFactory and use it
to set the EOS chars.

https://issues.apache.org/jira/browse/OPENNLP-434
This is a new feature that allow customizing the SentenceDetector. You can
extend the SentenceDetectorFactory and override methods as needed. You can
pass in the customized factory using both the command line or the API.


On Wed, May 30, 2012 at 7:19 PM, James Kosin<[email protected]>  wrote:

Hi Soubhik,

Should already be supported.
You have to pass the -encoding utf8 to the command line interface.

James

On 5/30/2012 1:52 PM, Soubhik (সৌভিক) wrote:

Hi,

I'm trying to use OpenNLP to train a sentence detector for Bengali

language

("bn"). I would like to add support for Unicode danda character in
opennlp.tools.sentdetect.lang.Factory
class. this character is a sentence break in Bengali, Hindi and several
other Indian languages. the code change should be small (<  10 lines).

Is it correct to think that a change of this size will not require a CLA?

Ref: en.wikipedia.org/wiki/*Danda*

Regards,
Soubhik.
--

Re: Unicode danda in sentence detector

Reply via email to