[ 
https://issues.apache.org/jira/browse/LUCENE-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004205#comment-13004205
 ] 

Robert Muir commented on LUCENE-2950:
-------------------------------------

bq. How would this work? E.g. many contribs depend on the common-analyzers 
module. Removing this dependency would almost certainly make the contribs 
non-functional.

The dependency is mostly bogus. Here is the contribs in question:
* ant
* demo
* lucli
* misc
* spellchecker
* swing
* wordnet

For example the ant IndexTask only depends on this so it can make this hashmap:
{noformat}
    static {
      analyzerLookup.put("simple", SimpleAnalyzer.class.getName());
      analyzerLookup.put("standard", StandardAnalyzer.class.getName());
      analyzerLookup.put("stop", StopAnalyzer.class.getName());
      analyzerLookup.put("whitespace", WhitespaceAnalyzer.class.getName());
    }
{noformat}

I think we could remove this, e.g. it already has reflection code to build the 
analyzer, if you supply "Xyz" why not just look for XyzAnalyzer as a fallback?

The lucli code has 'StandardAnalyzer' as a default: I think its best to not 
have a default analyzer at all. I would have fixed this already: but this 
contrib module has no tests! This makes it hard to want to get in there and 
clean up.

The misc code mostly supplies an Analyzer inside embedded tools that don't 
actually analyze anything. We could add a pkg-private NullAnalyzer that throws 
UOE on its tokenStream() <-- especially as they shouldnt be analyzing anything, 
so its reasonable to do?

The spellchecker code has a hardcoded WhitespaceAnalyzer... why is this? Seems 
like the whole spellchecking n-gramming is wrong anyway. Spellchecker uses a 
special form of n-gramming that depends upon the word length. Currently it does 
this in java code and indexes with WhitespaceAnalyzer (creating a lot of 
garbage in the process, e.g. lots of Field objects), but it seems this could 
all be cleaned up so that the spellchecker uses its own 
SpellCheckNgramAnalyzer, for better performance to boot.

The swing code defaults to a whitespaceanalyzer... in my opinion again its best 
to not have a default analyzer and make the user somehow specify one.

The wordnet code uses StandardAnalyzer for indexing the wordnet database. It 
also includes a very limited SynonymTokenFilter. In my opinion, now that we 
merged the SynonymTokenizer from solr that supports multi-word synonyms etc 
(which this wordnet module DOES NOT!), we should nuke this whole thing. 

Instead, we should make the synonym-loading process more flexible, so that one 
can produce the SynonymMap from various formats (such as the existing Solr 
format, a relational database, wordnet's format, or openoffice thesaurus 
format, among others). We could have parsers for these various formats. This 
would allow us to have a much more powerful synonym capability, that works 
nicely regardless of format. We could then look at other improvements, such as 
allowing SynonymFilter to use a more ram-conscious datastructure for its 
Synonym mappings (e.g. FST), and everyone would see the benefits.
So hopefully this entire contrib could be deprecated.



> Modules under top-level modules/ directory should be included in lucene's 
> build targets, e.g. 'package-tgz', 'package-tgz-src', and 'javadocs'
> ----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2950
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2950
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Build
>    Affects Versions: 4.0
>            Reporter: Steven Rowe
>            Priority: Blocker
>             Fix For: 4.0
>
>
> Lucene's top level {{modules/}} directory is not included in the binary or 
> source release distribution Ant targets {{package-tgz}} and 
> {{package-tgz-src}}, or in {{javadocs}}, in {{lucene/build.xml}}.  (However, 
> these targets do include Lucene contribs.)
> This issue is visible via the nightly Jenkins (formerly Hudson) job named 
> "Lucene-trunk", which publishes binary and source artifacts, using 
> {{package-tgz}} and {{package-tgz-src}}, as well as javadocs using the 
> {{javadocs}} target, all run from the top-level {{lucene/}} directory.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to