[
https://issues.apache.org/jira/browse/LUCENE-7940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150062#comment-16150062
]
Robert Muir commented on LUCENE-7940:
-------------------------------------
I took a deeper look, I like it. I'm able to run the experiments now:
Overall there is a ~16% improvement on my short test queries (Anandabazar
Patrika corpus):
||Analyzer||MAP||bpref||index size||
|Standard|0.2551|0.2644|118984K|
|Bengali|0.2947|0.2976|97120K|
I fixed some minor javadocs nits, and i found another corner case bug in the
normalizer. See a failing test for this here:
https://github.com/sunkuet02/lucene-solr/pull/1
Looks to be a similar case to the last one just in the Ba Phalaa case. Try out
the new test, maybe this one is easy for you to fix:
{noformat}
normalizer failed on input: '্ব' (\u09cd\u09ac)
java.lang.ArrayIndexOutOfBoundsException: -1
at
__randomizedtesting.SeedInfo.seed([DEEE93D60E1BE9C5:ACA2B6D9BF7B5FB6]:0)
at
org.apache.lucene.analysis.bn.BengaliNormalizer.normalize(BengaliNormalizer.java:108)
at
org.apache.lucene.analysis.bn.TestBengaliNormalizer.testRandom(TestBengaliNormalizer.java:83)
...
{noformat}
> Bengali Analyzer for Lucene
> ---------------------------
>
> Key: LUCENE-7940
> URL: https://issues.apache.org/jira/browse/LUCENE-7940
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: Md. Abdulla-Al-Sun
> Labels: features
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Dear All,
> I have noticed that, an
> issue([https://issues.apache.org/jira/browse/LUCENE-2725]) was created to add
> Bengali Analyzer into LUCENE but it was nearly 7(seven) years ago. I didn't
> see any update in that issue on JIRA.
> In few days ago, I am in need of analyzing my Bangla documents(I have used
> Elasticsearch). I have contacted with a member of elastic.co. He suggested me
> to do a contribution with my research codes to LUCENE.
> I have started reviewing the codes of "modules/analysis". I have noticed
> that, Hindi analyzer is added already. By following HindiAnalyzer and
> HindiStemmer codes, I have developed BengaliAnalyzer for LUCENE.
> I have followed two research papers and implemented features which are
> needed.
> Please give me instructions, what should I do next.
> Thanks
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]