Re: Unicode danda in sentence detector

William Colen Wed, 30 May 2012 18:36:03 -0700

As far as I know you don't need a CLA for a patch. Simply open a Jira and
attach your patch to it.

Besides what James pointed, you may also want change the EOS characters.
There are two related new features that are already implemented in the
trunk:

https://issues.apache.org/jira/browse/OPENNLP-428
This one added an optional command line argument where you set the
end-of-sentence characters. This setting will be persisted to the model. If
you are using the API you can create a SentenceDetectorFactory and use it
to set the EOS chars.

https://issues.apache.org/jira/browse/OPENNLP-434
This is a new feature that allow customizing the SentenceDetector. You can
extend the SentenceDetectorFactory and override methods as needed. You can
pass in the customized factory using both the command line or the API.

On Wed, May 30, 2012 at 7:19 PM, James Kosin <[email protected]> wrote:

> Hi Soubhik,
>
> Should already be supported.
> You have to pass the -encoding utf8 to the command line interface.
>
> James
>
> On 5/30/2012 1:52 PM, Soubhik (সৌভিক) wrote:
> > Hi,
> >
> > I'm trying to use OpenNLP to train a sentence detector for Bengali
> language
> > ("bn"). I would like to add support for Unicode danda character in
> > opennlp.tools.sentdetect.lang.Factory
> > class. this character is a sentence break in Bengali, Hindi and several
> > other Indian languages. the code change should be small (< 10 lines).
> >
> > Is it correct to think that a change of this size will not require a CLA?
> >
> > Ref: en.wikipedia.org/wiki/*Danda*
> >
> > Regards,
> > Soubhik.
> > --
> >
>
>

Re: Unicode danda in sentence detector

Reply via email to