[
https://issues.apache.org/jira/browse/LUCENE-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004205#comment-13004205
]
Robert Muir commented on LUCENE-2950:
-------------------------------------
bq. How would this work? E.g. many contribs depend on the common-analyzers
module. Removing this dependency would almost certainly make the contribs
non-functional.
The dependency is mostly bogus. Here is the contribs in question:
* ant
* demo
* lucli
* misc
* spellchecker
* swing
* wordnet
For example the ant IndexTask only depends on this so it can make this hashmap:
{noformat}
static {
analyzerLookup.put("simple", SimpleAnalyzer.class.getName());
analyzerLookup.put("standard", StandardAnalyzer.class.getName());
analyzerLookup.put("stop", StopAnalyzer.class.getName());
analyzerLookup.put("whitespace", WhitespaceAnalyzer.class.getName());
}
{noformat}
I think we could remove this, e.g. it already has reflection code to build the
analyzer, if you supply "Xyz" why not just look for XyzAnalyzer as a fallback?
The lucli code has 'StandardAnalyzer' as a default: I think its best to not
have a default analyzer at all. I would have fixed this already: but this
contrib module has no tests! This makes it hard to want to get in there and
clean up.
The misc code mostly supplies an Analyzer inside embedded tools that don't
actually analyze anything. We could add a pkg-private NullAnalyzer that throws
UOE on its tokenStream() <-- especially as they shouldnt be analyzing anything,
so its reasonable to do?
The spellchecker code has a hardcoded WhitespaceAnalyzer... why is this? Seems
like the whole spellchecking n-gramming is wrong anyway. Spellchecker uses a
special form of n-gramming that depends upon the word length. Currently it does
this in java code and indexes with WhitespaceAnalyzer (creating a lot of
garbage in the process, e.g. lots of Field objects), but it seems this could
all be cleaned up so that the spellchecker uses its own
SpellCheckNgramAnalyzer, for better performance to boot.
The swing code defaults to a whitespaceanalyzer... in my opinion again its best
to not have a default analyzer and make the user somehow specify one.
The wordnet code uses StandardAnalyzer for indexing the wordnet database. It
also includes a very limited SynonymTokenFilter. In my opinion, now that we
merged the SynonymTokenizer from solr that supports multi-word synonyms etc
(which this wordnet module DOES NOT!), we should nuke this whole thing.
Instead, we should make the synonym-loading process more flexible, so that one
can produce the SynonymMap from various formats (such as the existing Solr
format, a relational database, wordnet's format, or openoffice thesaurus
format, among others). We could have parsers for these various formats. This
would allow us to have a much more powerful synonym capability, that works
nicely regardless of format. We could then look at other improvements, such as
allowing SynonymFilter to use a more ram-conscious datastructure for its
Synonym mappings (e.g. FST), and everyone would see the benefits.
So hopefully this entire contrib could be deprecated.
> Modules under top-level modules/ directory should be included in lucene's
> build targets, e.g. 'package-tgz', 'package-tgz-src', and 'javadocs'
> ----------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: LUCENE-2950
> URL: https://issues.apache.org/jira/browse/LUCENE-2950
> Project: Lucene - Java
> Issue Type: Bug
> Components: Build
> Affects Versions: 4.0
> Reporter: Steven Rowe
> Priority: Blocker
> Fix For: 4.0
>
>
> Lucene's top level {{modules/}} directory is not included in the binary or
> source release distribution Ant targets {{package-tgz}} and
> {{package-tgz-src}}, or in {{javadocs}}, in {{lucene/build.xml}}. (However,
> these targets do include Lucene contribs.)
> This issue is visible via the nightly Jenkins (formerly Hudson) job named
> "Lucene-trunk", which publishes binary and source artifacts, using
> {{package-tgz}} and {{package-tgz-src}}, as well as javadocs using the
> {{javadocs}} target, all run from the top-level {{lucene/}} directory.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]