[jira] [Commented] (LUCENE-7940) Bengali Analyzer for Lucene

Robert Muir (JIRA) Thu, 31 Aug 2017 22:29:29 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150062#comment-16150062
 ]


Robert Muir commented on LUCENE-7940:
-------------------------------------

I took a deeper look, I like it. I'm able to run the experiments now:

Overall there is a ~16% improvement on my short test queries (Anandabazar 
Patrika corpus):
||Analyzer||MAP||bpref||index size||
|Standard|0.2551|0.2644|118984K|
|Bengali|0.2947|0.2976|97120K|

I fixed some minor javadocs nits, and i found another corner case bug in the 
normalizer. See a failing test for this here: 
https://github.com/sunkuet02/lucene-solr/pull/1

Looks to be a similar case to the last one just in the Ba Phalaa case. Try out 
the new test, maybe this one is easy for you to fix:

{noformat}
normalizer failed on input: '্ব' (\u09cd\u09ac)

java.lang.ArrayIndexOutOfBoundsException: -1
        at 
__randomizedtesting.SeedInfo.seed([DEEE93D60E1BE9C5:ACA2B6D9BF7B5FB6]:0)
        at 
org.apache.lucene.analysis.bn.BengaliNormalizer.normalize(BengaliNormalizer.java:108)
        at 
org.apache.lucene.analysis.bn.TestBengaliNormalizer.testRandom(TestBengaliNormalizer.java:83)
...
{noformat}



> Bengali Analyzer for Lucene
> ---------------------------
>
>                 Key: LUCENE-7940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7940
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Md. Abdulla-Al-Sun
>              Labels: features
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Dear All, 
> I have noticed that, an 
> issue([https://issues.apache.org/jira/browse/LUCENE-2725]) was created to add 
> Bengali Analyzer into LUCENE but it was nearly 7(seven) years ago. I didn't 
> see any update in that issue on JIRA. 
> In few days ago, I am in need of analyzing my Bangla documents(I have used 
> Elasticsearch). I have contacted with a member of elastic.co. He suggested me 
> to do a contribution with my research codes to LUCENE.
> I have started reviewing the codes of "modules/analysis". I have noticed 
> that, Hindi analyzer is added already. By following HindiAnalyzer and 
> HindiStemmer codes, I have developed BengaliAnalyzer for LUCENE. 
> I have followed two research papers and implemented features which are 
> needed. 
> Please give me instructions, what should I do next. 
> Thanks 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7940) Bengali Analyzer for Lucene

Reply via email to