[GitHub] lucenenet issue #179: Analysis work - Standard and Core namespaces (mostly)

2016-08-22 Thread NightOwl888
Github user NightOwl888 commented on the issue:

https://github.com/apache/lucenenet/pull/179
  
Okay, this is ready for review/merge now. I have synced it up with the 
master branch already, so there are some commits you have already reviewed 
here. But you can merge the master branch into your analysis-work branch (you 
might want to make a backup) to filter them out of the review.

About 95% of Analysis.Common is passing all tests. Of the 45 failing tests 
(out of 1405), the Synonym and Th namespaces account for most of the failures. 
But there is enough working functionality here that people will find it useful.

Hunspell took quite a bit of time to get working right, but in the end I 
ended up using the pure 4.8.0 implementation without any of the enhancements of 
more recent versions of Lucene. I discovered that the dictionary files are easy 
to find if you search for the file names on http://www.filewatcher.com/. There 
are also [several FTP 
sites](https://github.com/NightOwl888/lucenenet/blob/4d7b23c4269f0348a37fd470a3339befc64332ec/src/Lucene.Net.Tests.Analysis.Common/Analysis/Hunspell/TestAllDictionaries.cs#L34-L36)
 where you can grab the OpenOffice dictionaries. As was done in Java, the 
dictionary binaries are not part of the repository, and if you want to test 
Hunspell with real dictionaries you must enable the tests manually and download 
the dictionaries yourself. 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] lucenenet issue #179: Analysis work - Standard and Core namespaces (mostly)

2016-08-22 Thread NightOwl888
Github user NightOwl888 commented on the issue:

https://github.com/apache/lucenenet/pull/179
  
I am working on wrapping this up now. Lucene.Net.Analysis.Common is now 
ported completely except for the Collation namespace and a few odd tests.

I have morphed the Lucene 6.1.0 Hunspell code together with the 4.8.0 
public API, and now all tests are passing except one. Sadly, with the pure 
4.8.0 code they were all green. I have gone through line by line but the actual 
piece that is broken still eludes me for now. It might have something to do 
with the OfflineSorter changes in 6.1.0, but it seems like there should be more 
failing tests if that were the case.

Anyway, when push comes to shove it is probably better to have one failing 
test than to have an old version that works the way it was intended but simply 
doesn't load any modern dictionary (unless you have a way to provide the old 
dictionaries along with this Lucene.Net version).

When all is said and done, should I put everything into this pull request 
or open a new one targeting the master branch? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] lucenenet issue #179: Analysis work - Standard and Core namespaces (mostly)

2016-08-20 Thread NightOwl888
Github user NightOwl888 commented on the issue:

https://github.com/apache/lucenenet/pull/179
  
Update


I am nearly completed with Analysis. The only sections yet remaining are 
`Collation` and `Analysis.Compound`. Nearly all tests have been ported and 
there are currently 42 failing out of 1384. However, I have run into a few 
snags. 

I have managed to make `Analysis.Hunspell` pass all of the tests for this 
Lucene version. However, when I started porting the `RunAllDictionaries` and 
`RunAllDictionaries2` tests (that use live data), it turns out that version 
4.8.0 of Lucene doesn't work with the latest dictionaries because the 
dictionary format has changed.

I think the simplest solution would be to upgrade just the Hunspell 
namespace to a more recent version of Lucene. I have BeyondCompare, so it is 
pretty simple to determine what the delta is and just port that part over. I 
wanted to run this by the team before going forward, and also get some opinions 
on whether the latest released version of Lucene is the appropriate point to 
upgrade to (this functionality doesn't appear to have changed much beyond what 
it takes to support newer dictionaries).

Another problem I ran into is that the OpenOffice dictionaries aren't 
available at the [location 
specified](https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/test/org/apache/lucene/analysis/hunspell/TestAllDictionaries.java#L35)
 
(http://archive.services.openoffice.org/pub/mirror/OpenOffice.org/contrib/dictionaries/).
 As you can see here (http://archive.services.openoffice.org/pub/mirror/), the 
OpenOffice.org directory no longer exists. Any ideas where I can obtain them?

One other related matter is *where* to actually put these files. In Java 
the binaries are not in the repository. So, should I add a line to .gitignore 
and use the `\test-files\analysis\data\thunderbirdDicts\` directory as the 
point to look for them, or do you have another preference?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] lucenenet issue #179: Analysis work - Standard and Core namespaces (mostly)

2016-08-15 Thread synhershko
Github user synhershko commented on the issue:

https://github.com/apache/lucenenet/pull/179
  
What is the difference / overlap between this PR and 
https://github.com/apache/lucenenet/pull/173 ? also I can see 
https://github.com/NightOwl888/lucenenet/tree/analysis-work-2 - can you please 
help me wrap my head around all those changesets?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---