[GitHub] lucenenet issue #179: Analysis work - Standard and Core namespaces (mostly)
Github user NightOwl888 commented on the issue: https://github.com/apache/lucenenet/pull/179 Okay, this is ready for review/merge now. I have synced it up with the master branch already, so there are some commits you have already reviewed here. But you can merge the master branch into your analysis-work branch (you might want to make a backup) to filter them out of the review. About 95% of Analysis.Common is passing all tests. Of the 45 failing tests (out of 1405), the Synonym and Th namespaces account for most of the failures. But there is enough working functionality here that people will find it useful. Hunspell took quite a bit of time to get working right, but in the end I ended up using the pure 4.8.0 implementation without any of the enhancements of more recent versions of Lucene. I discovered that the dictionary files are easy to find if you search for the file names on http://www.filewatcher.com/. There are also [several FTP sites](https://github.com/NightOwl888/lucenenet/blob/4d7b23c4269f0348a37fd470a3339befc64332ec/src/Lucene.Net.Tests.Analysis.Common/Analysis/Hunspell/TestAllDictionaries.cs#L34-L36) where you can grab the OpenOffice dictionaries. As was done in Java, the dictionary binaries are not part of the repository, and if you want to test Hunspell with real dictionaries you must enable the tests manually and download the dictionaries yourself. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] lucenenet issue #179: Analysis work - Standard and Core namespaces (mostly)
Github user NightOwl888 commented on the issue: https://github.com/apache/lucenenet/pull/179 I am working on wrapping this up now. Lucene.Net.Analysis.Common is now ported completely except for the Collation namespace and a few odd tests. I have morphed the Lucene 6.1.0 Hunspell code together with the 4.8.0 public API, and now all tests are passing except one. Sadly, with the pure 4.8.0 code they were all green. I have gone through line by line but the actual piece that is broken still eludes me for now. It might have something to do with the OfflineSorter changes in 6.1.0, but it seems like there should be more failing tests if that were the case. Anyway, when push comes to shove it is probably better to have one failing test than to have an old version that works the way it was intended but simply doesn't load any modern dictionary (unless you have a way to provide the old dictionaries along with this Lucene.Net version). When all is said and done, should I put everything into this pull request or open a new one targeting the master branch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] lucenenet issue #179: Analysis work - Standard and Core namespaces (mostly)
Github user NightOwl888 commented on the issue: https://github.com/apache/lucenenet/pull/179 Update I am nearly completed with Analysis. The only sections yet remaining are `Collation` and `Analysis.Compound`. Nearly all tests have been ported and there are currently 42 failing out of 1384. However, I have run into a few snags. I have managed to make `Analysis.Hunspell` pass all of the tests for this Lucene version. However, when I started porting the `RunAllDictionaries` and `RunAllDictionaries2` tests (that use live data), it turns out that version 4.8.0 of Lucene doesn't work with the latest dictionaries because the dictionary format has changed. I think the simplest solution would be to upgrade just the Hunspell namespace to a more recent version of Lucene. I have BeyondCompare, so it is pretty simple to determine what the delta is and just port that part over. I wanted to run this by the team before going forward, and also get some opinions on whether the latest released version of Lucene is the appropriate point to upgrade to (this functionality doesn't appear to have changed much beyond what it takes to support newer dictionaries). Another problem I ran into is that the OpenOffice dictionaries aren't available at the [location specified](https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/test/org/apache/lucene/analysis/hunspell/TestAllDictionaries.java#L35) (http://archive.services.openoffice.org/pub/mirror/OpenOffice.org/contrib/dictionaries/). As you can see here (http://archive.services.openoffice.org/pub/mirror/), the OpenOffice.org directory no longer exists. Any ideas where I can obtain them? One other related matter is *where* to actually put these files. In Java the binaries are not in the repository. So, should I add a line to .gitignore and use the `\test-files\analysis\data\thunderbirdDicts\` directory as the point to look for them, or do you have another preference? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] lucenenet issue #179: Analysis work - Standard and Core namespaces (mostly)
Github user synhershko commented on the issue: https://github.com/apache/lucenenet/pull/179 What is the difference / overlap between this PR and https://github.com/apache/lucenenet/pull/173 ? also I can see https://github.com/NightOwl888/lucenenet/tree/analysis-work-2 - can you please help me wrap my head around all those changesets? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---