Re: Remaining Work/Priorities

Itamar Syn-Hershko Wed, 12 Oct 2016 10:07:32 -0700

CI failure seems to be worked on:
https://twitter.com/codebetterCI/status/785854074713468932 (Thanks Wyatt
for pointing that out)


I will look into the rest in a little while

--

Itamar Syn-Hershko
http://code972.com | @synhershko <https://twitter.com/synhershko>
Freelance Developer & Consultant
Lucene.NET committer and PMC member

On Tue, Oct 11, 2016 at 10:10 PM, Shad Storhaug <[email protected]>
wrote:

> Update
> ======
>
> I have just pushed some commits that fix several bugs in the
> Lucene.Net.Codecs project (all 452 tests pass most of the time, a few
> random failures) and fix all but 4 of the failing tests in Lucene.Net.Core.
>
>
> Fix for Test Context
> -------------------------
>
> For now, I have added method override stubs to each subclass in order to
> add the [Test] attribute, so NUnit will run them in the correct context. I
> did that on all of the superclass tests except for the ones in QueryParser
> (since Itamar mentioned he would be working in that area). Itamar, you will
> probably need to follow suit to get all of the QP tests to pass - namely
> with the QueryParserTestBase and TestQueryParser classes.
>
> I have carefully put all of these changes into a single commit so it can
> be reverted easily, if this solution doesn't happen to be compatible with
> xUnit: https://github.com/apache/lucenenet/commit/
> 2a79edea6359e1ee1f83269cc7dc3ef2753ebf2c. Hopefully that makes life
> easier for @conniey.
>
> @Itamar, let me know when this is completed on your end so I can do a
> double revert and squash the test stubs from QueryParser into an
> all-inclusive revert-able commit.
>
> We can now correctly see how many tests we have in the core. Currently
> there are 2730 - it seems we are still missing 720 tests, assuming they all
> were for something port-able.
>
>
> Remaining Tests
> ---------------------
>
> Next I plan to work on locating any tests that we have missed (starting in
> the core). It seems these fall into several categories:
>
> 1. Tests that have not yet been ported.
> 2. Tests that have been partially ported that have not been added to the
> project.
> 3. Tests that have been ported, but are missing the [Test] attribute.
> 4. Tests in classes that have been ported that have been commented out
> (presumably because at the time they were ported the dependencies did not
> yet exist).
> 5. Tests that have been Ignored in .NET that were not in Java.
> 6. Tests that have NUnit Assume.That() logic that depends on some
> non-existant JRE condition, so they are not running in .NET.
>
> I'll make a quick effort to get them to pass, but the main goal will be to
> ensure they all can run and are included in the project. Just a heads up
> that the number of test failures is likely to increase on this pass (but
> the number of bugs will likely decrease).
>
>
> Failing Core Tests
> -----------------------
>
> I have looked into the remaining tests somewhat. There are 2 issues that I
> need some input on to solve.
>
>
> TestRamUsageEstimator.TestSanity()
>
> Java Lucene uses a JRE-specific API to determine how much header size to
> add on each field. This makes the estimates higher in Java. But more
> importantly, this test is failing because the estimate for a real string
> instance is coming back as the same size as its shallow size (16 bytes in
> this case) - it needs to be at least 1 byte more than that for the test to
> pass. In Java (at least in a 64 bit environment), there are an extra 4
> bytes being added for each field.
>
> Technically, there is a way to get these numbers from .NET, but it
> involves calling undocumented APIs using pointers and will likely be
> different from one .NET version to the next (a bad idea for a project that
> needs to support multiple .NET versions). The only solution I can think of
> is to hard code in an extra 4 bytes for 64 bit (and most likely 2 bytes for
> 32 bit) in order to make the numbers for the instances larger than their
> shallow size. I suppose the alternative would be to either comment out the
> string test or change it to >= make it pass. Thoughts? Alternatives?
>
>
> TestNumericDocValuesUpdates.TestUpdateOldSegments()
>
> I discovered what the issue is here (normally that is the hard part), but
> it seems that the proper solution is going to be a major task. The
> NamedSPILoader (backed by SPIClassIterator) in Java Lucene is used as a
> service locator to load classes throughout the project. In the Codec
> abstract class, it is used to load up the codec for the context it is used
> in. However, our port of the NamedSPILoader simply loads all of the classes
> from the current AppDomain without any way to order them or override them.
>
> The problem is that in Lucene, this was meant to be an extension point.
> And this particular test (and probably many more of them) uses that
> extension point to change the codec to a Mock from the test framework. This
> line from TestRuleSetupAndRestoreClassEnv pretty much sums up what the
> issue is:
>
> > Debug.Assert(Codec is Lucene42RWCodec, "fix your classpath to have
> tests-framework.jar before lucene-core.jar");
>
> Basically, it is using a configuration file to order the classes that are
> loaded so the test mocks take priority over the built-in codecs.
>
> Just fixing the test could be done by making the static NamedSPILoader
> variable in the Codec class internal and swapping in a test double.
> However, that doesn't solve the bigger issue that Lucene.Net is missing its
> extensibility for anyone who wants to write their own codec (or tap into
> one of the other extensibility points). I guess the bigger question is how
> important will it be for anyone to extend Lucene codecs or inject
> dependencies into Analyzer factories? There doesn’t appear to be any more
> extensibility than that in Lucene 4.8.0, but that could change in more
> recent or future versions of Lucene.
>
>
> CI Builds
> -----------
>
> Not working. Can someone look into that please?
>
>
> Thanks,
> Shad Storhaug (NightOwl888)
>
>
>
> -----Original Message-----
> From: Shad Storhaug
> Sent: Wednesday, October 5, 2016 8:23 PM
> To: [email protected]
> Cc: Connie Yau; '[email protected]'
> Subject: RE: Remaining Work/Priorities
>
> > Analysis.ICU (Depends on ICU4j) hopefully we can remove the ICU DLLs
> from the analysis.commons module?
>
> Just for clarification, these are two entirely different things in Java.
> Analysis.Common (Analysis.Collator and Analysis.Th) depends on parts of
> Java:
>
> import java.text.BreakIterator;
> import java.text.Collator;
> import java.text.ParseException;
> import java.text.RuleBasedCollator;
>
> Highlighter.PostingsHighlighter and Highlighter.VectorHighlight also
> depend on parts of Java:
>
> import java.text.BreakIterator;
> import java.text.CharacterIterator;
>
> Analysis.ICU depends on a separate (icu4j) package:
>
> import com.ibm.icu.text.Normalizer;
> import com.ibm.icu.text.Normalizer2;
> import com.ibm.icu.text.Transliterator;
> import com.ibm.icu.text.Replaceable;
> import com.ibm.icu.text.Transliterator;
> import com.ibm.icu.text.UTF16;
> import com.ibm.icu.text.UnicodeSet;
> import com.ibm.icu.text.FilteredNormalizer2;
> import com.ibm.icu.text.Collator;
> import com.ibm.icu.text.RuleBasedCollator;
> import com.ibm.icu.util.ULocale;
> import com.ibm.icu.text.RawCollationKey;
>
> That said, icu4j DOES have Collator and RuleBasedCollator classes, but it
> DOES NOT have a BreakIterator or CharacterIterator class. It is unclear
> whether the Collator from icu4j would work as a replacement for the one in
> core Java.
>
> When I was digging through the JDK code, I noticed that BreakIterator and
> RuleBasedCollator have a lot of common ICU dependencies there, so even if
> the RuleBasedCollator from icu4j is compatible, it might make sense for us
> to port the one from Java anyway so we are dealing with the same shared
> dependencies in Analysis.Common.
>
> Once we port over the classes from the Java JDK, we will be able to
> eliminate our current ICU4NET dependency (and the platform issues that come
> with it). That said, porting over those pieces could take considerable
> work. In the interim it might make sense to make separate projects/NuGet
> packages to isolate the areas that depend on BreakIterator,
> CharacterIterator, and RuleBasedCollator so the rest can be released for
> wide/cross-platform use. Perhaps we can even make a basic (scaled down)
> BreakIterator for Highlighter that breaks on spaces between words and
> punctuation between sentences, which wouldn't work for Thai, but would work
> for most other languages.
>
> Porting the (icu4j) package is another complete ball of yarn, we should
> take a look at (https://github.com/sillsdev/icu-dotnet) to see if there
> is enough overlap there to power Analysis.ICU (offhand it looks as though
> some classes are missing, though). It is a wrapper around the C library -
> it may be that we just need to port more of it to get all of the pieces we
> need.
>
> Speaking of Collation, @ChristopherHaws have you made any more progress on
> Analysis.Collation? Were you able to determine if icu-dotnet's collator
> will make the tests pass?
>
> > I'm on it QueryParser.Flexible
>
> Great. The TimeZone probably just needs more research to work out how to
> utilize (in order to implement the failing test). Also, FYI MSDN's
> recommendation (https://msdn.microsoft.com/en-us/library/system.timezone(
> v=vs.110).aspx) is to use TimeZoneInfo rather than TimeZone (I noticed
> that several of the tests were recently modified to use TimeZone rather
> than TimeZoneInfo).
>
> As for the culture, in .NET I am pretty sure that we need to pass it as a
> parameter to another overload of `QueryParser.Parse` rather than making it
> a property of QueryParser. But we can deal with that in one step after you
> have finished porting.
>
> --
>
> Shad Storhaug (NightOwl888)
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of Itamar Syn-Hershko
> Sent: Wednesday, October 5, 2016 5:28 AM
> To: [email protected]
> Cc: Connie Yau
> Subject: Re: Remaining Work/Priorities
>
> Awesome, thanks for all the hard work Shad!
>
> Our first priority should be fixing all remaining tests - in particular
> the one in Core. We should be ready to release and stamp our builds as 100%
> stable. As you mentioned, this could be an infrastructure issue - hopefully
> *Connie *can give a status update on her effort on the switch to xUnit?
>
> With regards to Modules, here's an updated breakdown based on your email +
> forgotten pieces + my comments:
>
> *Ported:*
> Lucene.Net (Core) - 15 failing / 1989 total Lucene.Net.Analysis.Common - 0
> failing / 1445 total Lucene.Net.Classification - 0 failing / 9 total
> Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet - (including
> #188 will be) 0 failing / 152 total Lucene.Net.Join - 0 failing / 27 total
> Lucene.Net.Memory - 0 failing / 10 total Lucene.Net.Misc - 2 failing / 42
> total Lucene.Net.Queries - 2 failing / 96 total Lucene.Net.QueryParser - 1
> failing / 203 total Lucene.Net.Suggest - 0 failing / 142 total
>
> We should do a second pass on the pieces we marked as ported, just to make
> sure the port is full and we didn't leave anything behind :)
>
> *Need to be ported:*
> Highlighter (Depends on Collator (which is still being ported) and
> BreakIterator (which we don't have a solution that works in .NET core yet))
> Spatial (has 3rd party libraries that need to be updates) Spatial4n (
> https://github.com/synhershko/Spatial4N) needs to be brought up to speed
> with spatial4j, dependencies of which may cause some issues....
> Codecs
> Partially ported, mostly the tests weren't ported Grouping Not urgent, but
> provides nice functionality that users will probably like
>
> The only part with dependencies seems to be the spatial module - I will
> have a look there soon if you don't get to that before I do.
>
> *Can wait* - some modules are less frequently used, we should stabilize
> and release first and then work on them based on demand Analysis.ICU
> (Depends on ICU4j) hopefully we can remove the ICU DLLs from the
> analysis.commons module? I keep getting reports on some issues they are
> causing Analysis.Kuromoji Analysis.Morfologik (Depends on Morfologik)
> Analysis.Phonetic (Depends on Apache Commons) Apache commons is mostly
> helper libraries, so there's probably not real dependency just lots of
> replacement Analysis.SmartCN Analysis.Stempel (currently in progress)
> Analysis.UIMA (Depends on Tagger, uimaj-core, WhiteSpaceTokenizer) Demo
> while important because can help newbies, we can do better by providing
> docs and real world examples. I'm on it QueryParser.Flexible
>
> *No need to port* - neither are needed in our context Benchmark (many
> dependencies) Replicator (many dependencies) Sandbox (Depends on Apache
> Jakarta)
>
> Once all modules are ported and all tests are passing, I think we should
> get two more items fixed before an official release:
>
> 1. .NET Core support - I'm not clear on the status of it at the moment. We
> probably want to have it in for the release.
>
> 2. Public API Inconsistencies. We can discuss what should be done and what
> not when we get to that stage. Some are an obvious "fixme" but some will
> break code compatibility with Java I think we should avoid.
>
> One last note - *Wyatt*, do we know why there are no CI builds lately?
>
> --
>
> Itamar Syn-Hershko
> http://code972.com | @synhershko <https://twitter.com/synhershko>
> Freelance Developer & Consultant Lucene.NET committer and PMC member
>
> On Sun, Oct 2, 2016 at 10:01 PM, Shad Storhaug <[email protected]>
> wrote:
>
> > Hello,
> >
> > I just wanted to open this discussion to talk about the work remaining
> > to be done on Lucene.Net version 4.8.0. We are nearly there, but that
> > doesn't mean we don't still need help!
> >
> >
> > FAILING TESTS
> > -------------------
> >
> > We now have over 5000 passing tests and as soon as pull request #188 (
> > https://github.com/apache/lucenenet/pull/188) is merged, by my count
> > we have only 20 (actual) failing tests. Here is the breakdown by project:
> >
> > Lucene.Net (Core) - 15 failing / 1989 total Lucene.Net.Analysis.Common
> > - 0 failing / 1445 total Lucene.Net.Classification - 0 failing / 9
> > total Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet -
> > (including #188 will be) 0 failing / 152 total Lucene.Net.Join - 0
> > failing / 27 total Lucene.Net.Memory - 0 failing / 10 total
> > Lucene.Net.Misc - 2 failing / 42 total Lucene.Net.Queries - 2 failing
> > / 96 total Lucene.Net.QueryParser - 1 failing / 203 total
> > Lucene.Net.Suggest - 0 failing / 142 total
> >
> > The reason why I said ACTUAL tests above is because I recently
> > discovered that many of the "failures" that are being reported are
> > false negatives (in fact, the VS2015 NUnit test runner shows there are
> > 135 failing tests total and 902 tests total that don't belong to any
> > project). Most NUnit 2.6 test runners do not correctly run tests in
> > shared abstract classes with the correct context (test setup) to make
> > them pass. These out-of-context runs add several additional minutes to
> the test run.
> >
> > As an experiment, I upgraded to NUnit 3.4.1 and it helped the
> > situation somewhat - that is, it ran the tests in the correct context
> > and I was able to determine that we have more tests than the numbers
> > above and they are all succeeding. However, it also ran the tests in
> > an invalid context (that is, the context of the abstract class without
> > any setup) and some of them still showed as failures.
> >
> > I know @conniey is currently working on porting the tests over to xUnit.
> > Hopefully, swapping test frameworks alone (or using some of the new
> > fancy test attributes) is enough to fix this issue. If not, we need to
> > find another solution - preferably one that can be applied to all of
> > the tests in abstract classes without too much effort or changing them
> > so they are too different from their Java counterpart.
> >
> > Remaining Pieces to Port
> > ---------------------------------
> >
> > I took an inventory of the remaining pieces left to port a few days
> > ago and here is what that looks like (alphabetical order):
> >
> > 1. Analysis.ICU (Depends on ICU4j)
> > 2. Analysis.Kuromoji
> > 3. Analysis.Morfologik (Depends on Morfologik) 4. Analysis.Phonetic
> > (Depends on Apache Commons) 5. Analysis.SmartCN 6. Analysis.Stempel
> > (currently in progress) 7. Analysis.UIMA (Depends on Tagger,
> > uimaj-core, WhiteSpaceTokenizer) 8. Benchmark (many dependencies) 9.
> > Demo 10. Highlighter (Depends on Collator (which is still being
> > ported) and BreakIterator (which we don't have a solution that works
> > in .NET core yet)) 11. Replicator (many dependencies) 12. Sandbox
> > (Depends on Apache Jakarta) 13. Spatial (Already ported in #174
> > (https://github.com/apache/ lucenenet/pull/174), needs a recent
> > version of spatial4n) 14. QueryParser.Flexible
> >
> > Itamar, it would be helpful if you would be so kind as to organize
> > this list in terms of priority. It also couldn't hurt to update the
> > contributing documents
> > (https://github.com/apache/lucenenet/blob/master/CONTRIBUTING.md,
> > and
> > https://cwiki.apache.org/confluence/display/LUCENENET/Current+Status
> > with the latest information so anyone who wants to help out knows the
> > current status.
> >
> > Of course, it is the known status of dependencies that we need
> > clarification on. Which of these dependencies is known to be ported?
> > Which of them are ported but are not up to date? Which of them are
> > known not to be ported, and which of them are unknown?
> >
> >
> > Public API Inconsistencies
> > ---------------------------------
> >
> > One thing that I have had my eye on for a while now is the
> > .NETification/consistency of the core API (that is, in the Lucene.Net
> > project). There are several issues that I would like to address
> including:
> >
> >
> > 1.       Method names that are still camelCase
> >
> > 2.       Properties that should be methods (because they do a lot of
> > processing or because they are non-deterministic)
> >
> > 3.       Methods that should be properties
> >
> > 4.       .Size() vs .Size vs .Count - should generally all be .Count in
> > .NET
> >
> > 5.       Interfaces should begin with "I"
> >
> > 6.       Classes should not begin with "I" followed by another capital
> > letter (for some reason some of them were named that way)
> >
> > 7.       .CharAt() should probably be this[]
> >
> > 8.       Generic types nested within generic types (which cause Visual
> > Studio to crash when Intellisense tries to read them)
> >
> > ... and so on. The only thing is these are all sweeping changes that
> > will affect everyone helping out on Lucene.Net and anyone who is
> > currently using the beta. So, I just wanted to gather some input on
> > when the most appropriate time to begin working on these sweeping
> changes would be?
> >
> >
> > Thanks,
> > Shad Storhaug (NightOwl888)
> >
> >
> >
> >
> >
> >
> >
>

Re: Remaining Work/Priorities

Reply via email to