Hi Koorosh, Lucene analyzers and tokenfilters are discovered via Java SPI (see https://docs.oracle.com/javase/tutorial/sound/SPI-intro.html). In order to make your TokenFilter discoverable, you need to add the fully qualified classname of your factory to the file resources/META-INF/org.apache.lucene.analyzer.util.TokenFilterFactory
Alan Woodward www.flax.co.uk On 16 Nov 2015, at 23:55, Koorosh Vakhshoori wrote: > Hi all, > I am in process of creating a patch for Lucene. However, I can’t get > the JUnit test TestAllAnalyzersHaveFactories pass. Hope this is the > right forum for help. If not kindly direct me to the correct forum. > Any help is greatly appreciated! > > First, some background. The patch is building on Ted Sullivan work, > SOLR-7136. It is an enhanced version of AutoPhrase which I like to > submit to community. The code includes a new TokenFilter, > AutoPhrasingTokenFilter with Junit tests. I have created following > package: > > org.apache.lucene.analysis.autophrase > > This package contains the following class files: > > AutoPhraseDetector.java > AutoPhrasingTokenFilter.java > AutoPhrasingTokenFilterFactory.java > package-info.java > > When running the test under ant, the test > TestAllAnalyzersHaveFactories fails with following output, I have > added some print statements for debugging: > ============================================================ > -test: > [junit4] <JUnit4> says ????! Master seed: 86F1C35C6CE11696 > [junit4] Your default console's encoding may not display certain > unicode glyphs: US-ASCII > [junit4] Executing 1 suite with 1 JVM. > [junit4] > [junit4] Started J0 PID(15156@localhost). > [junit4] Suite: > org.apache.lucene.analysis.core.TestAllAnalyzersHaveFactories > [junit4] 1> clazzName: IndicNormalizationFilter > [junit4] 1> simpleName: IndicNormalization > [junit4] 1> clazzName: HyphenationCompoundWordTokenFilter > [junit4] 1> simpleName: HyphenationCompoundWord > [junit4] 1> clazzName: DictionaryCompoundWordTokenFilter > [junit4] 1> simpleName: DictionaryCompoundWord > [junit4] 1> clazzName: BulgarianStemFilter > [junit4] 1> simpleName: BulgarianStem > [junit4] 1> clazzName: ShingleFilter > [junit4] 1> simpleName: Shingle > [junit4] 1> clazzName: ReverseStringFilter > [junit4] 1> simpleName: ReverseString > [junit4] 1> clazzName: GreekLowerCaseFilter > [junit4] 1> simpleName: GreekLowerCase > [junit4] 1> clazzName: GreekStemFilter > [junit4] 1> simpleName: GreekStem > [junit4] 1> clazzName: HungarianLightStemFilter > [junit4] 1> simpleName: HungarianLightStem > [junit4] 1> clazzName: GermanNormalizationFilter > [junit4] 1> simpleName: GermanNormalization > [junit4] 1> clazzName: GermanLightStemFilter > [junit4] 1> simpleName: GermanLightStem > [junit4] 1> clazzName: GermanMinimalStemFilter > [junit4] 1> simpleName: GermanMinimalStem > [junit4] 1> clazzName: GermanStemFilter > [junit4] 1> simpleName: GermanStem > [junit4] 1> clazzName: EnglishPossessiveFilter > [junit4] 1> simpleName: EnglishPossessive > [junit4] 1> clazzName: EnglishMinimalStemFilter > [junit4] 1> simpleName: EnglishMinimalStem > [junit4] 1> clazzName: PorterStemFilter > [junit4] 1> simpleName: PorterStem > [junit4] 1> clazzName: KStemFilter > [junit4] 1> simpleName: KStem > [junit4] 1> clazzName: ItalianLightStemFilter > [junit4] 1> simpleName: ItalianLightStem > [junit4] 1> clazzName: HindiStemFilter > [junit4] 1> simpleName: HindiStem > [junit4] 1> clazzName: HindiNormalizationFilter > [junit4] 1> simpleName: HindiNormalization > [junit4] 1> clazzName: RussianLightStemFilter > [junit4] 1> simpleName: RussianLightStem > [junit4] 1> clazzName: ClassicFilter > [junit4] 1> simpleName: Classic > [junit4] 1> clazzName: StandardFilter > [junit4] 1> simpleName: Standard > [junit4] 1> clazzName: CzechStemFilter > [junit4] 1> simpleName: CzechStem > [junit4] 1> clazzName: ElisionFilter > [junit4] 1> simpleName: Elision > [junit4] 1> clazzName: DelimitedPayloadTokenFilter > [junit4] 1> simpleName: DelimitedPayload > [junit4] 1> clazzName: TokenOffsetPayloadTokenFilter > [junit4] 1> simpleName: TokenOffsetPayload > [junit4] 1> clazzName: NumericPayloadTokenFilter > [junit4] 1> simpleName: NumericPayload > [junit4] 1> clazzName: TypeAsPayloadTokenFilter > [junit4] 1> simpleName: TypeAsPayload > [junit4] 1> clazzName: AutoPhrasingTokenFilter > [junit4] 1> simpleName: AutoPhrasing > [junit4] 2> NOTE: reproduce with: ant test > -Dtestcase=TestAllAnalyzersHaveFactories -Dtests.method=test > -Dtests.seed=86F1C35C6CE11696 -Dtests.slow=true -Dtests.locale=zh_CN > -Dtests.timezone=US/Samoa -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 > [junit4] ERROR 2.94s | TestAllAnalyzersHaveFactories.test <<< > [junit4] > Throwable #1: java.lang.IllegalArgumentException: A > SPI class of type org.apache.lucene.analysis.util.TokenFilterFactory > with name 'AutoPhrasing' does not exist. You need to add the > corresponding JAR file supporting this SPI to your classpath. The > current classpath supports the following names: [apostrophe, > arabicnormalization, arabicstem, bulgarianstem, brazilianstem, > cjkbigram, cjkwidth, soraninormalization, soranistem, commongrams, > commongramsquery, dictionarycompoundword, hyphenationcompoundword, > decimaldigit, lowercase, stop, type, uppercase, czechstem, > germanlightstem, germanminimalstem, germannormalization, germanstem, > greeklowercase, greekstem, englishminimalstem, englishpossessive, > kstem, porterstem, spanishlightstem, persiannormalization, > finnishlightstem, frenchlightstem, frenchminimalstem, irishlowercase, > galicianminimalstem, galicianstem, hindinormalization, hindistem, > hungarianlightstem, hunspellstem, indonesianstem, indicnormalization, > italianlightstem, latvianstem, asciifolding, capitalization, > codepointcount, fingerprint, hyphenatedwords, keepword, keywordmarker, > keywordrepeat, length, limittokencount, limittokenoffset, > limittokenposition, removeduplicates, stemmeroverride, trim, truncate, > worddelimiter, scandinavianfolding, scandinaviannormalization, > edgengram, ngram, norwegianlightstem, norwegianminimalstem, > patternreplace, patterncapturegroup, delimitedpayload, numericpayload, > tokenoffsetpayload, typeaspayload, portugueselightstem, > portugueseminimalstem, portuguesestem, reversestring, > russianlightstem, shingle, snowballporter, serbiannormalization, > classic, standard, swedishlightstem, synonym, turkishlowercase, > elision] > [junit4] > at > __randomizedtesting.SeedInfo.seed([86F1C35C6CE11696:EA5FC86C21D7B6E]:0) > [junit4] > at > org.apache.lucene.analysis.util.AnalysisSPILoader.lookupClass(AnalysisSPILoader.java:135) > [junit4] > at > org.apache.lucene.analysis.util.TokenFilterFactory.lookupClass(TokenFilterFactory.java:42) > [junit4] > at > org.apache.lucene.analysis.core.TestAllAnalyzersHaveFactories.test(TestAllAnalyzersHaveFactories.java:168) > [junit4] > at java.lang.Thread.run(Thread.java:745) > [junit4] 2> NOTE: test params are: codec=CheapBastard, > sim=ClassicSimilarity, locale=zh_CN, timezone=US/Samoa > [junit4] 2> NOTE: Linux 2.6.32-358.el6.x86_64 amd64/Oracle > Corporation 1.8.0_05 > (64-bit)/cpus=4,threads=1,free=136794808,total=160432128 > [junit4] 2> NOTE: All tests run in this JVM: > [TestAllAnalyzersHaveFactories] > [junit4] Completed [1/1] in 4.33s, 1 test, 1 error <<< FAILURES! > [junit4] > [junit4] > [junit4] Tests with failures [seed: 86F1C35C6CE11696]: > [junit4] - > org.apache.lucene.analysis.core.TestAllAnalyzersHaveFactories.test > [junit4] > [junit4] > [junit4] JVM J0: 0.66 .. 6.09 = 5.44s > [junit4] Execution time total: 6.11 sec. > [junit4] Tests summary: 1 suite, 1 test, 1 error > ================================================ > > Running the test under debugger in Eclipse, it gives the same error > message for a different Factory class 'DaitchMokitoffSoundex'. This > may or may not be related to my issue, not sure. > > My guess is there is some sort of class loader issue. My understanding > of the test is that it is making sure there is a corresponding > TokenFilter Factory for a TokenFilter. In this case that would be > AutoPhrasingTokenFilterFactory. Now, I checked to make sure the class > is created. The 'find' command shows the class at: > > build/analysis/common/classes/java/org/apache/lucene/analysis/autophrase/AutoPhrasingTokenFilterFactory.class > > The location is similar to other Filter factories. > > I have put in print statement as well as running the test in Eclipse > debugger. As far as I can see, the test code sees the > AutoPhrasingTokenFilter. Looking at > TestAllAnalyzersHaveFactories.java, at line marked with '1>', the test > code picks up the class AutoPhrasingTokenFilter. However, when it gets > to line '2>', it fails: > > =========================================== > public void test() throws Exception { > 1> List<Class<?>> analysisClasses = > TestRandomChains.getClassesForPackage("org.apache.lucene.analysis"); > > ClassLoader cl = ClassLoader.getSystemClassLoader(); > > URL[] urls = ((URLClassLoader)cl).getURLs(); > // System.out.println("ClassPath Start:"); > for(URL url: urls){ > // System.out.println(url.getFile()); > } > // System.out.println("ClassPath Ends!"); > > for (final Class<?> c : analysisClasses) { > final int modifiers = c.getModifiers(); > if ( > // don't waste time with abstract classes > Modifier.isAbstract(modifiers) || !Modifier.isPublic(modifiers) > || c.isSynthetic() || c.isAnonymousClass() || > c.isMemberClass() || c.isInterface() > || testComponents.contains(c) > || crazyComponents.contains(c) > || oddlyNamedComponents.contains(c) > || c.isAnnotationPresent(Deprecated.class) // deprecated ones > are typically back compat hacks > || !(Tokenizer.class.isAssignableFrom(c) || > TokenFilter.class.isAssignableFrom(c) || > CharFilter.class.isAssignableFrom(c)) > ) { > continue; > } > > Map<String,String> args = new HashMap<>(); > args.put("luceneMatchVersion", Version.LATEST.toString()); > > if (Tokenizer.class.isAssignableFrom(c)) { > String clazzName = c.getSimpleName(); > assertTrue(clazzName.endsWith("Tokenizer")); > String simpleName = clazzName.substring(0, clazzName.length() - 9); > assertNotNull(TokenizerFactory.lookupClass(simpleName)); > TokenizerFactory instance = null; > try { > instance = TokenizerFactory.forName(simpleName, args); > assertNotNull(instance); > if (instance instanceof ResourceLoaderAware) { > ((ResourceLoaderAware) instance).inform(loader); > } > assertSame(c, instance.create().getClass()); > } catch (IllegalArgumentException e) { > if (e.getCause() instanceof NoSuchMethodException) { > // there is no corresponding ctor available > throw e; > } > // TODO: For now pass because some factories have not yet a > default config that always works > } > } else if (TokenFilter.class.isAssignableFrom(c)) { > String clazzName = c.getSimpleName(); > System.out.println("clazzName: " + clazzName); > assertTrue(clazzName.endsWith("Filter")); > String simpleName = clazzName.substring(0, clazzName.length() > - (clazzName.endsWith("TokenFilter") ? 11 : 6)); > System.out.println("simpleName: " + simpleName); > 2> assertNotNull(TokenFilterFactory.lookupClass(simpleName)); > ===================================================== > > Here is the code for the factory class: > > package org.apache.lucene.analysis.autophrase; > > /* > * Copyright 2015 Synopsys, Inc. > * > * Licensed under the Apache License, Version 2.0 (the "License"); you > * may not use this file except in compliance with the License. You may > * obtain a copy of the License at > * > * http://www.apache.org/licenses/LICENSE-2.0 > * > * Unless required by applicable law or agreed to in writing, software > * distributed under the License is distributed on an "AS IS" BASIS, > * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > * See the License for the specific language governing permissions and > * limitations under the License. > */ > > import java.io.IOException; > import java.util.Map; > > import org.apache.lucene.analysis.TokenStream; > import org.apache.lucene.analysis.util.CharArraySet; > import org.apache.lucene.analysis.util.ResourceLoader; > import org.apache.lucene.analysis.util.ResourceLoaderAware; > import org.apache.lucene.analysis.util.TokenFilterFactory; > > public class AutoPhrasingTokenFilterFactory extends TokenFilterFactory > implements ResourceLoaderAware { > > private CharArraySet phraseSets; > private final String phraseSetFiles; > private final boolean ignoreCase; > private final boolean emitSingleTokens; > private final boolean quotePhrase; > private final boolean emitAmbiguousPhrases; > > private String replaceWhitespaceWith = null; > > public AutoPhrasingTokenFilterFactory(Map<String, String> initArgs) { > super( initArgs ); > phraseSetFiles = get(initArgs, "phrases"); > ignoreCase = getBoolean( initArgs, "ignoreCase", false); > emitSingleTokens = getBoolean( initArgs, "includeTokens", false ); > quotePhrase = getBoolean( initArgs, "quotePhrase", false ); > emitAmbiguousPhrases = getBoolean( initArgs, > "emitAmbiguousPhrases", false ); > > String replaceWhitespaceArg = initArgs.get( "replaceWhitespaceWith" ); > if (replaceWhitespaceArg != null) { > replaceWhitespaceWith = replaceWhitespaceArg; > } > } > > @Override > public void inform(ResourceLoader loader) throws IOException { > if (phraseSetFiles != null) { > phraseSets = getWordSet(loader, phraseSetFiles, ignoreCase); > } > } > > @Override > public TokenStream create( TokenStream input ) { > AutoPhrasingTokenFilter autoPhraseFilter = new > AutoPhrasingTokenFilter( input, phraseSets, emitSingleTokens ); > if (replaceWhitespaceWith != null) { > autoPhraseFilter.setReplaceWhitespaceWith( new Character( > replaceWhitespaceWith.charAt( 0 )) ); > } > //Doesn't make send to emit phrases in double quotes if > replaceWhitespaceWith character is set. > if ((replaceWhitespaceWith == null) && quotePhrase) { > autoPhraseFilter.setQuotePhrase(quotePhrase); > } > if (emitAmbiguousPhrases) { > autoPhraseFilter.setEmitAmbiguousPhrases(emitAmbiguousPhrases); > } > return autoPhraseFilter; > } > } > > Thanks, > > Koorosh > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org >