Re: Unicode danda in sentence detector

সৌভিক Sun, 03 Jun 2012 02:34:13 -0700

On Thu, May 31, 2012 at 1:36 PM, Jörn Kottmann <[email protected]> wrote:
>
> The wikipedia reference says its commonly used for
> Indian languages, maybe we just should just include them,
> e.g. like we did for Portuguese.
>
> On the other side we might also need custom feature
> generation to get good results.
> How are words are delimited in Indian? With spaces?


words are delimited by spaces in bengali, hindi and most other Indian
languages.

>
> I suggest to first test with passing in the danda char,
> measure how it performs, and then decide if we might also
> need an adaption of the feature generation for Indian languages.

I started with a very small docset (about 1500 sentences from
news/blogs downloaded from the internet) and no abbreviations, no
custom features. I used the -eosChars '।?!' and got the following
result:

Precision: 0.8967468175388967
Recall: 0.8386243386243386
F-Measure: 0.8667122351332877

as you've mentioned, the danda is a sentence break in multiple Indian
languages. so does it make sense to add it in the Factory?

>
> Do you have training data you can train it on? If there is a publicly
> available data set me would appreciate having format support for it
> directly in OpenNLP.
>

I'll refine the model using a larger dataset and possibly, with an
abbreviations dictionary. I believe it should be possible to do it on
stuff openly available.

Cheers!
Soubhik.

> What do you think?
>
> Jörn
>
>
> On 05/31/2012 03:35 AM, William Colen wrote:
>>
>> As far as I know you don't need a CLA for a patch. Simply open a Jira and
>> attach your patch to it.
>>
>> Besides what James pointed, you may also want change the EOS characters.
>> There are two related new features that are already implemented in the
>> trunk:
>>
>> https://issues.apache.org/jira/browse/OPENNLP-428
>> This one added an optional command line argument where you set the
>> end-of-sentence characters. This setting will be persisted to the model.
>> If
>> you are using the API you can create a SentenceDetectorFactory and use it
>> to set the EOS chars.
>>
>> https://issues.apache.org/jira/browse/OPENNLP-434
>> This is a new feature that allow customizing the SentenceDetector. You
>> can
>> extend the SentenceDetectorFactory and override methods as needed. You
>> can
>> pass in the customized factory using both the command line or the API.
>>
>>
>> On Wed, May 30, 2012 at 7:19 PM, James Kosin<[email protected]>
>>  wrote:
>>
>>> Hi Soubhik,
>>>
>>> Should already be supported.
>>> You have to pass the -encoding utf8 to the command line interface.
>>>
>>> James
>>>
>>> On 5/30/2012 1:52 PM, Soubhik (সৌভিক) wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm trying to use OpenNLP to train a sentence detector for Bengali
>>>
>>> language
>>>>
>>>> ("bn"). I would like to add support for Unicode danda character in
>>>> opennlp.tools.sentdetect.lang.Factory
>>>> class. this character is a sentence break in Bengali, Hindi and several
>>>> other Indian languages. the code change should be small (<  10 lines).
>>>>
>>>> Is it correct to think that a change of this size will not require a
>>>> CLA?
>>>>
>>>> Ref: en.wikipedia.org/wiki/*Danda*
>>>>
>>>> Regards,
>>>> Soubhik.
>>>> --
>>>>
>>>
>



--
Soubhik Bhattacharya

Re: Unicode danda in sentence detector

Reply via email to