[
https://issues.apache.org/jira/browse/LUCENE-7940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148355#comment-16148355
]
Robert Muir commented on LUCENE-7940:
-------------------------------------
I ran some basic tests, but encountered some issues.
I couldn't index the test collection with the new analyzer as-is as it hit
AIOOBE from the normalizer in various places, such as the "else" case of Ja
Phala normalization, I think it is easy to see how this can go wrong on some
strings.
{noformat}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at
org.apache.lucene.analysis.util.StemmerUtil.delete(StemmerUtil.java:96)
at
org.apache.lucene.analysis.bn.BengaliNormalizer.normalize(BengaliNormalizer.java:90)
at
org.apache.lucene.analysis.bn.BengaliNormalizationFilter.incrementToken(BengaliNormalizationFilter.java:53)
at
org.apache.lucene.analysis.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:51)
at
org.apache.lucene.analysis.bn.BengaliStemFilter.incrementToken(BengaliStemFilter.java:41)
...
{noformat}
The existing tests will also find bugs in the normalization if you run them
enough:
{noformat}
rmuir@beast:~/workspace/lucene-solr/lucene/analysis/common$ ant beast
-Dbeast.iters=100 -Dtestcase=TestBengaliAnalyzer -Dtestmethod=testRandomStrings
...
[beaster] Started J0 PID(15042@localhost).
[beaster] 2> TEST FAIL: useCharFilter=true text='?><!-
\ua880\ued49\uda48\udc50\u60a4\u0001\u3f30\u0497\u0385\u8961 \u09c5\u09af'
[beaster] 2> NOTE: reproduce with: ant test -Dtestcase=TestBengaliAnalyzer
-Dtests.method=testRandomStrings -Dtests.seed=7DCC89234C956F75
-Dtests.locale=fi -Dtests.timezone=UTC -Dtests.asserts=true
-Dtests.file.encoding=US-ASCII
[beaster] [22:16:36.471] ERROR 0.19s |
TestBengaliAnalyzer.testRandomStrings <<<
[beaster] > Throwable #1: java.lang.StringIndexOutOfBoundsException:
String index out of range: -1
[beaster] > at
__randomizedtesting.SeedInfo.seed([7DCC89234C956F75:F545899DEF913840]:0)
[beaster] > at java.lang.String.<init>(String.java:195)
[beaster] > at
org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.toString(CharTermAttributeImpl.java:259)
[beaster] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:733)
[beaster] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
[beaster] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:540)
[beaster] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:448)
[beaster] > at
org.apache.lucene.analysis.bn.TestBengaliAnalyzer.testRandomStrings(TestBengaliAnalyzer.java:51)
[beaster] > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
[beaster] > at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[beaster] > at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[beaster] > at java.lang.reflect.Method.invoke(Method.java:497)
[beaster] > at
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1713)
[beaster] > at
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:907)
[beaster] > at
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:943)
[beaster] > at
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:957)
[beaster] > at
org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
[beaster] > at
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
[beaster] > at
org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
[beaster] > at
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
[beaster] > at
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
[beaster] > at
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
[beaster] > at
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
[beaster] > at
com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
[beaster] > at
com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
[beaster] > at
com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:916)
[beaster] > at
com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:802)
[beaster] > at
com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:852)
[beaster] > at
com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:863)
[beaster] > at
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
[beaster] > at
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
[beaster] > at
org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
[beaster] > at
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
[beaster] > at
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
[beaster] > at
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
[beaster] > at
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
[beaster] > at
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
[beaster] > at
org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
[beaster] > at
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
[beaster] > at
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
[beaster] > at
org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
[beaster] > at
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
[beaster] > at
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
[beaster] > at java.lang.Thread.run(Thread.java:745)
[beaster] 2> NOTE: test params are: codec=Asserting(Lucene70):
{dummy=PostingsFormat(name=Memory)}, docValues:{}, maxPointsInLeafNode=562,
maxMBSortInHeap=5.206999081402232, sim=RandomSimilarity(queryNorm=true): {},
locale=fi, timezone=UTC
[beaster] 2> NOTE: Linux 4.4.0-92-generic amd64/Oracle Corporation 1.8.0_45
(64-bit)/cpus=8,threads=1,free=154136664,total=189267968
[beaster] 2> NOTE: All tests run in this JVM: [TestBengaliAnalyzer]
[beaster]
[beaster] Tests with failures [seed: 7DCC89234C956F75]:
[beaster] -
org.apache.lucene.analysis.bn.TestBengaliAnalyzer.testRandomStrings
[beaster]
{noformat}
I think the normalizer could use some more thorough tests?
> Bengali Analyzer for Lucene
> ---------------------------
>
> Key: LUCENE-7940
> URL: https://issues.apache.org/jira/browse/LUCENE-7940
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: Md. Abdulla-Al-Sun
> Labels: features
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Dear All,
> I have noticed that, an
> issue([https://issues.apache.org/jira/browse/LUCENE-2725]) was created to add
> Bengali Analyzer into LUCENE but it was nearly 7(seven) years ago. I didn't
> see any update in that issue on JIRA.
> In few days ago, I am in need of analyzing my Bangla documents(I have used
> Elasticsearch). I have contacted with a member of elastic.co. He suggested me
> to do a contribution with my research codes to LUCENE.
> I have started reviewing the codes of "modules/analysis". I have noticed
> that, Hindi analyzer is added already. By following HindiAnalyzer and
> HindiStemmer codes, I have developed BengaliAnalyzer for LUCENE.
> I have followed two research papers and implemented features which are
> needed.
> Please give me instructions, what should I do next.
> Thanks
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]