On Thu, May 31, 2012 at 1:36 PM, Jörn Kottmann <[email protected]> wrote: > > The wikipedia reference says its commonly used for > Indian languages, maybe we just should just include them, > e.g. like we did for Portuguese. > > On the other side we might also need custom feature > generation to get good results. > How are words are delimited in Indian? With spaces?
words are delimited by spaces in bengali, hindi and most other Indian languages. > > I suggest to first test with passing in the danda char, > measure how it performs, and then decide if we might also > need an adaption of the feature generation for Indian languages. I started with a very small docset (about 1500 sentences from news/blogs downloaded from the internet) and no abbreviations, no custom features. I used the -eosChars '।?!' and got the following result: Precision: 0.8967468175388967 Recall: 0.8386243386243386 F-Measure: 0.8667122351332877 as you've mentioned, the danda is a sentence break in multiple Indian languages. so does it make sense to add it in the Factory? > > Do you have training data you can train it on? If there is a publicly > available data set me would appreciate having format support for it > directly in OpenNLP. > I'll refine the model using a larger dataset and possibly, with an abbreviations dictionary. I believe it should be possible to do it on stuff openly available. Cheers! Soubhik. > What do you think? > > Jörn > > > On 05/31/2012 03:35 AM, William Colen wrote: >> >> As far as I know you don't need a CLA for a patch. Simply open a Jira and >> attach your patch to it. >> >> Besides what James pointed, you may also want change the EOS characters. >> There are two related new features that are already implemented in the >> trunk: >> >> https://issues.apache.org/jira/browse/OPENNLP-428 >> This one added an optional command line argument where you set the >> end-of-sentence characters. This setting will be persisted to the model. >> If >> you are using the API you can create a SentenceDetectorFactory and use it >> to set the EOS chars. >> >> https://issues.apache.org/jira/browse/OPENNLP-434 >> This is a new feature that allow customizing the SentenceDetector. You >> can >> extend the SentenceDetectorFactory and override methods as needed. You >> can >> pass in the customized factory using both the command line or the API. >> >> >> On Wed, May 30, 2012 at 7:19 PM, James Kosin<[email protected]> >> wrote: >> >>> Hi Soubhik, >>> >>> Should already be supported. >>> You have to pass the -encoding utf8 to the command line interface. >>> >>> James >>> >>> On 5/30/2012 1:52 PM, Soubhik (সৌভিক) wrote: >>>> >>>> Hi, >>>> >>>> I'm trying to use OpenNLP to train a sentence detector for Bengali >>> >>> language >>>> >>>> ("bn"). I would like to add support for Unicode danda character in >>>> opennlp.tools.sentdetect.lang.Factory >>>> class. this character is a sentence break in Bengali, Hindi and several >>>> other Indian languages. the code change should be small (< 10 lines). >>>> >>>> Is it correct to think that a change of this size will not require a >>>> CLA? >>>> >>>> Ref: en.wikipedia.org/wiki/*Danda* >>>> >>>> Regards, >>>> Soubhik. >>>> -- >>>> >>> > -- Soubhik Bhattacharya
