[GitHub] lucenenet issue #182: Analysis Missing Tests and Bug Fixes

NightOwl888 Sat, 27 Aug 2016 02:22:04 -0700

Github user NightOwl888 commented on the issue:

https://github.com/apache/lucenenet/pull/182

Ok, this is now down to 23 failing tests.

The 17 failing tests in Synonym are still really no closer to being solved.
I went over the SynonymMap and SynonymFilter classes line by line 3x. Wherever
the problem is, it is hidden well.

After spending a whole day stepping through code, I finally found a clue -
all of the failing tests are failing when the expected synonym input has a
space in it. For example, TestMatching doesn't fail until [this
line](https://github.com/NightOwl888/lucenenet/blob/analysis-bugz/src/Lucene.Net.Tests.Analysis.Common/Analysis/Synonym/TestSynonymMapFilter.cs#L875)
when the first expected input is "z x c v". It is unclear how that is supposed
to happen, though since the tokenizer makes "z" a separate token which causes
the logic to exit out at that point without comparing "z x", "z x c", and "z x
c v". I went online hunting for a clue, but only found [this question on
SO](http://stackoverflow.com/questions/17283100/lucene-synonym-filter-behavior)
in which the poster is just as confused about it as I am.

I also tried again at the 5 failing tests in the Compound namespace. I went
over everything line by line. Then I tried stepping through the code. However,
I don't have a clue what the code is supposed to do, only what the expected
output is. In [this
test](https://github.com/NightOwl888/lucenenet/blob/analysis-bugz/src/Lucene.Net.Tests.Analysis.Common/Analysis/Compound/TestCompoundWordTokenFilter.cs#L84),
the first output succeeds. The second output is expected to be "ba". The first
token [comes back as
"b"](https://github.com/NightOwl888/lucenenet/blob/analysis-bugz/src/Lucene.Net.Analysis.Common/Analysis/Compound/hyphenation/HyphenationTree.cs#L414)
(is that right?), it then looks up
[TernaryTree.Find()](https://github.com/NightOwl888/lucenenet/blob/analysis-bugz/src/Lucene.Net.Analysis.Common/Analysis/Compound/hyphenation/HyphenationTree.cs#L415)
and it maps to "a" (is that right?), it then puts it as the second letter of
the word array (that seems right..?). The next letter i
s "a", it looks it up and comes back as "z"(is that right?) it adds it as the
3rd element in the array (now that can't be right, can it?), the next letters
it looks up are "r" and "j". The documentation is scarce. I really don't see
any hope of solving this without running side-by-side with the Java Lucene to
see where the paths diverge. Although, the most likely cause has something to
do with replacing the SAX parser with XmlReader and the HyphenationTree isn't
being populated right. But, it is difficult to know what "right" is, since
there are no tests on the HyphenationTree itself.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet issue #182: Analysis Missing Tests and Bug Fixes

Reply via email to