Lucene.Net.Analysis.Common Work

Shad Storhaug Wed, 24 Aug 2016 02:57:25 -0700

Itamar,

Just updating you regarding the work on Analysis on the mailing list as per 
your request.


To the rest, if anyone wants to help close out Analysis.Common, please 
coordinate with Itamar.

Scope of Remaining 43 Failing Tests

17 of the failing tests are in the Analysis.Synonym namespace (out of 51). I am 
not sure exactly what is happening here, but I am convinced that this issue is 
local to the namespace (or the tests) and it can be overcome with a little 
diligence. I plan to have another look at this today. Anyone who beats me to a 
solution gets an "atta-boy" (or "atta-girl") from me.

14 of the failing tests are in the Analysis.Th namespace (out of 16), and is 
subsequently causing the Analysis.Core.TestFactories.Test() method to fail (15 
failing tests total). That is, Thai support is completely broken. Using the 
Clone() method of the BreakIterator subsequently results in null reference 
exceptions when SetText() is called. It is unclear why cloning is even 
necessary, though, and it appears this issue can be overcome by creating the 
iterators using the static methods instead of cloning. However the Thai words 
are still not breaking correctly, and I am not sure what is happening there. 
Looks like the documentation needs to be consulted on ICU to figure out how to 
use it properly. I haven't tried all of the available locales (only US 
English), but I do find it odd that there is no Locale for Thai in the package 
though since that appears to be the only language it supports (well, perhaps 
Lao, Khmer and Burmese will need this, but we don't have support for them 
anyway). I came up with another thought today that I didn't try as well - 
perhaps the culture of the current thread needs to be changed to Thai in order 
to make it work. Worth another look.

5 of the failing tests are in the Analysis.Compound namespace (out of 17). I 
had to take a slightly different approach than the Java implementation because 
it uses a SAX XML parser of which there is no direct equivalent in .NET, so 
made do with the XmlReader. It appears to be loading the data correctly, but it 
is difficult to verify this without cross-checking it against its Java 
counterpart. So, it is unknown whether the bugs are in the XML parsing, the 
data structure, the code that uses the data structure, some dependency of the 
code, or all of the above.

2 of the failing tests are in the HtmlStripCharFilterTest class. For some 
reason, there are null characters being added to the random tests which are not 
parsing correctly. The data being printed out in the log is not helpful to 
determine what the input is that is causing the failures. I attempted to pass 
null characters in using a couple of different forms and the parser seems to 
handle them fine. My gut feeling tells me that this is a problem with the 
random strings that are being generated for the test, not a problem with the 
parser.

1 of the failing tests is in the Analysis.Tr namespace (Turkish). The cause of 
this failure appears to be in the core. The Character.CodePointAt() method is 
returning character 304 rather than the expected LATIN_CAPITAL_LETTER_I + 
non-spacing mark character. I didn't look into it further than that.

1 of the failing tests, Analysis.Core.TestBugInSomething.TestWrapping() appears 
to be an issue with porting the test, and I will address that today.

1 of the failing tests, 
Analysis.Pattern.TestCaptureGroupTokenFilter.TestRandomString() is (I suspect) 
another issue with the random string generation.

1 of the failing tests, Analysis.Util.TestCharArrayMap.TestToString() is 
failing because we are not passing an object reference of CharArrayMap into the 
CharArraySet when it is instantiated in the CharArrayMap.KeySet() method. 
Without the direct object reference, calling ToString() on the 
CharArrayMap.Keys property returns an incorrect result because the CharArraySet 
and CharArrayMap are not referencing the same array. As far as I can tell, this 
is not causing an issue other than this one test.

It's a tricky issue because CharArrayMap is generic and CharArraySet is not. 
There are at least 3 ways we could solve this:


1.       Wrap the array (and other shared dependencies) into an internal 
reference type so both classes are referring to the same instance.

2.       Create a non-generic ICharArrayMap interface that references the few 
generic members as type object, and use the interface rather than the concrete 
type as the type to pass to CharArraySet.

3.       Make CharArraySet generic so its type remains in sync with the 
CharArrayMap.

I took a stab at #3 on this branch: 
https://github.com/NightOwl888/lucenenet/tree/analysis-chararray-generic, and 
it appears like it can compile that way, but it caused some other problems and 
I didn't get a chance to see if they could all be resolved.

It also occurred to me that rather than making an abstract CharArraySet and 
generic CharArraySet<V>, it might be worth considering making an ICharArraySet 
interface to pass the concrete type around without knowing its type, and making 
the CharArraySet a superclass of CharArraySet<string>, since more than 95% of 
the time we are using it with a string anyway (how many words do you know that 
are not strings?). It might be confusing, though - most people would expect an 
untyped CharArraySet class to be type object rather than string.

Not Yet Ported

There are just a few (known) remaining pieces missing out of Analysis.Common.


1.       Collation - Looks like we need to port over the Java Collator class. I 
took a stab at it in [this WIP 
commit](https://github.com/NightOwl888/lucenenet/commit/e6a037b2bb971b1015b2fc96eca663267df08273).
 AFAIK, ChristopherHaws is porting it 
(https://github.com/apache/lucenenet/pull/181#issuecomment-241481735), check 
with him if you want to give it a try as there doesn't seem to be much to it 
but will require some finesse to make it work with .NET.

2.       Analysis.Core.UpperCaseFilterFactory - Not sure how I missed it, but I 
will get on this today.

3.       Analysis.Miscellaneous.PatternAnalyzer - Obsolete and superseded by 
the Analysis.Pattern namespace. Do we really need this?

4.       Analysis.Core.TestAllAnalyzersHaveFactories

5.       Analysis.Core.TestRandomChains

6.       Analysis.Miscellaneous.PatternAnalyzerTest

7.       Analysis.Miscellaneous.TestSingleTokenFilter

8.       Analysis.Util.TestAnalysisSPILoader


Testing Issues

As for putting all of the Hunspell language dictionary (in compatible versions) 
into the repository to enable automated testing, I am not sure how well that 
would be received. The binaries are together an additional 108 MB, which means 
a lot of extra stuff to download if cloning Lucene.Net. In addition, the tests 
take quite a while to finish - at least 20-30 minutes. I think the Java dev 
team made the right choice by making this a manual task.

In any case, I have created a repo here: 
https://github.com/NightOwl888/lucenenet-hunspell that you can 
clone/copy/rehost to share these known working dictionary files with users of 
Lucene.Net. That is not to say that other versions of dictionaries won't work, 
but there can be issues when .NET doesn't recognize the encoding string or 
there are tags that the Hunspell parser doesn't recognize.

It also occurred to me that there are probably other hidden issues that won't 
rear their ugly head unless we change the culture of the current thread while 
testing. We should modify the test framework to test another culture on demand 
or perhaps randomly to weed out any localization issues while parsing strings.


Merging to Master/Prerelease

The only real concern here might be that the solution ultimately used to fix 
(or not fix) CharArraySet might end up changing the public API of CharArraySet. 
Other than that, I don't see any real issues that are holding up the release 
here. Thoughts?


Shad Storhaug/NightOwl888

Lucene.Net.Analysis.Common Work

Reply via email to