Hi Koorosh,

Lucene analyzers and tokenfilters are discovered via Java SPI (see 
https://docs.oracle.com/javase/tutorial/sound/SPI-intro.html).  In order to 
make your TokenFilter discoverable, you need to add the fully qualified 
classname of your factory to the file 
resources/META-INF/org.apache.lucene.analyzer.util.TokenFilterFactory

Alan Woodward
www.flax.co.uk


On 16 Nov 2015, at 23:55, Koorosh Vakhshoori wrote:

> Hi all,
>  I am in process of creating a patch for Lucene. However, I can’t get
> the JUnit test TestAllAnalyzersHaveFactories pass. Hope this is the
> right forum for help. If not kindly direct me to the correct forum.
> Any help is greatly appreciated!
> 
>  First, some background. The patch is building on Ted Sullivan work,
> SOLR-7136. It is an enhanced version of AutoPhrase which I like to
> submit to community. The code includes a new TokenFilter,
> AutoPhrasingTokenFilter with Junit tests. I have created following
> package:
> 
> org.apache.lucene.analysis.autophrase
> 
> This package contains the following class files:
> 
> AutoPhraseDetector.java
> AutoPhrasingTokenFilter.java
> AutoPhrasingTokenFilterFactory.java
> package-info.java
> 
> When running the test under ant, the test
> TestAllAnalyzersHaveFactories fails with following output, I have
> added some print statements for debugging:
> ============================================================
> -test:
>   [junit4] <JUnit4> says ????! Master seed: 86F1C35C6CE11696
>   [junit4] Your default console's encoding may not display certain
> unicode glyphs: US-ASCII
>   [junit4] Executing 1 suite with 1 JVM.
>   [junit4]
>   [junit4] Started J0 PID(15156@localhost).
>   [junit4] Suite: 
> org.apache.lucene.analysis.core.TestAllAnalyzersHaveFactories
>   [junit4]   1> clazzName: IndicNormalizationFilter
>   [junit4]   1> simpleName: IndicNormalization
>   [junit4]   1> clazzName: HyphenationCompoundWordTokenFilter
>   [junit4]   1> simpleName: HyphenationCompoundWord
>   [junit4]   1> clazzName: DictionaryCompoundWordTokenFilter
>   [junit4]   1> simpleName: DictionaryCompoundWord
>   [junit4]   1> clazzName: BulgarianStemFilter
>   [junit4]   1> simpleName: BulgarianStem
>   [junit4]   1> clazzName: ShingleFilter
>   [junit4]   1> simpleName: Shingle
>   [junit4]   1> clazzName: ReverseStringFilter
>   [junit4]   1> simpleName: ReverseString
>   [junit4]   1> clazzName: GreekLowerCaseFilter
>   [junit4]   1> simpleName: GreekLowerCase
>   [junit4]   1> clazzName: GreekStemFilter
>   [junit4]   1> simpleName: GreekStem
>   [junit4]   1> clazzName: HungarianLightStemFilter
>   [junit4]   1> simpleName: HungarianLightStem
>   [junit4]   1> clazzName: GermanNormalizationFilter
>   [junit4]   1> simpleName: GermanNormalization
>   [junit4]   1> clazzName: GermanLightStemFilter
>   [junit4]   1> simpleName: GermanLightStem
>   [junit4]   1> clazzName: GermanMinimalStemFilter
>   [junit4]   1> simpleName: GermanMinimalStem
>   [junit4]   1> clazzName: GermanStemFilter
>   [junit4]   1> simpleName: GermanStem
>   [junit4]   1> clazzName: EnglishPossessiveFilter
>   [junit4]   1> simpleName: EnglishPossessive
>   [junit4]   1> clazzName: EnglishMinimalStemFilter
>   [junit4]   1> simpleName: EnglishMinimalStem
>   [junit4]   1> clazzName: PorterStemFilter
>   [junit4]   1> simpleName: PorterStem
>   [junit4]   1> clazzName: KStemFilter
>   [junit4]   1> simpleName: KStem
>   [junit4]   1> clazzName: ItalianLightStemFilter
>   [junit4]   1> simpleName: ItalianLightStem
>   [junit4]   1> clazzName: HindiStemFilter
>   [junit4]   1> simpleName: HindiStem
>   [junit4]   1> clazzName: HindiNormalizationFilter
>   [junit4]   1> simpleName: HindiNormalization
>   [junit4]   1> clazzName: RussianLightStemFilter
>   [junit4]   1> simpleName: RussianLightStem
>   [junit4]   1> clazzName: ClassicFilter
>   [junit4]   1> simpleName: Classic
>   [junit4]   1> clazzName: StandardFilter
>   [junit4]   1> simpleName: Standard
>   [junit4]   1> clazzName: CzechStemFilter
>   [junit4]   1> simpleName: CzechStem
>   [junit4]   1> clazzName: ElisionFilter
>   [junit4]   1> simpleName: Elision
>   [junit4]   1> clazzName: DelimitedPayloadTokenFilter
>   [junit4]   1> simpleName: DelimitedPayload
>   [junit4]   1> clazzName: TokenOffsetPayloadTokenFilter
>   [junit4]   1> simpleName: TokenOffsetPayload
>   [junit4]   1> clazzName: NumericPayloadTokenFilter
>   [junit4]   1> simpleName: NumericPayload
>   [junit4]   1> clazzName: TypeAsPayloadTokenFilter
>   [junit4]   1> simpleName: TypeAsPayload
>   [junit4]   1> clazzName: AutoPhrasingTokenFilter
>   [junit4]   1> simpleName: AutoPhrasing
>   [junit4]   2> NOTE: reproduce with: ant test
> -Dtestcase=TestAllAnalyzersHaveFactories -Dtests.method=test
> -Dtests.seed=86F1C35C6CE11696 -Dtests.slow=true -Dtests.locale=zh_CN
> -Dtests.timezone=US/Samoa -Dtests.asserts=true
> -Dtests.file.encoding=UTF-8
>   [junit4] ERROR   2.94s | TestAllAnalyzersHaveFactories.test <<<
>   [junit4]    > Throwable #1: java.lang.IllegalArgumentException: A
> SPI class of type org.apache.lucene.analysis.util.TokenFilterFactory
> with name 'AutoPhrasing' does not exist. You need to add the
> corresponding JAR file supporting this SPI to your classpath. The
> current classpath supports the following names: [apostrophe,
> arabicnormalization, arabicstem, bulgarianstem, brazilianstem,
> cjkbigram, cjkwidth, soraninormalization, soranistem, commongrams,
> commongramsquery, dictionarycompoundword, hyphenationcompoundword,
> decimaldigit, lowercase, stop, type, uppercase, czechstem,
> germanlightstem, germanminimalstem, germannormalization, germanstem,
> greeklowercase, greekstem, englishminimalstem, englishpossessive,
> kstem, porterstem, spanishlightstem, persiannormalization,
> finnishlightstem, frenchlightstem, frenchminimalstem, irishlowercase,
> galicianminimalstem, galicianstem, hindinormalization, hindistem,
> hungarianlightstem, hunspellstem, indonesianstem, indicnormalization,
> italianlightstem, latvianstem, asciifolding, capitalization,
> codepointcount, fingerprint, hyphenatedwords, keepword, keywordmarker,
> keywordrepeat, length, limittokencount, limittokenoffset,
> limittokenposition, removeduplicates, stemmeroverride, trim, truncate,
> worddelimiter, scandinavianfolding, scandinaviannormalization,
> edgengram, ngram, norwegianlightstem, norwegianminimalstem,
> patternreplace, patterncapturegroup, delimitedpayload, numericpayload,
> tokenoffsetpayload, typeaspayload, portugueselightstem,
> portugueseminimalstem, portuguesestem, reversestring,
> russianlightstem, shingle, snowballporter, serbiannormalization,
> classic, standard, swedishlightstem, synonym, turkishlowercase,
> elision]
>   [junit4]    >        at
> __randomizedtesting.SeedInfo.seed([86F1C35C6CE11696:EA5FC86C21D7B6E]:0)
>   [junit4]    >        at
> org.apache.lucene.analysis.util.AnalysisSPILoader.lookupClass(AnalysisSPILoader.java:135)
>   [junit4]    >        at
> org.apache.lucene.analysis.util.TokenFilterFactory.lookupClass(TokenFilterFactory.java:42)
>   [junit4]    >        at
> org.apache.lucene.analysis.core.TestAllAnalyzersHaveFactories.test(TestAllAnalyzersHaveFactories.java:168)
>   [junit4]    >        at java.lang.Thread.run(Thread.java:745)
>   [junit4]   2> NOTE: test params are: codec=CheapBastard,
> sim=ClassicSimilarity, locale=zh_CN, timezone=US/Samoa
>   [junit4]   2> NOTE: Linux 2.6.32-358.el6.x86_64 amd64/Oracle
> Corporation 1.8.0_05
> (64-bit)/cpus=4,threads=1,free=136794808,total=160432128
>   [junit4]   2> NOTE: All tests run in this JVM:
> [TestAllAnalyzersHaveFactories]
>   [junit4] Completed [1/1] in 4.33s, 1 test, 1 error <<< FAILURES!
>   [junit4]
>   [junit4]
>   [junit4] Tests with failures [seed: 86F1C35C6CE11696]:
>   [junit4]   -
> org.apache.lucene.analysis.core.TestAllAnalyzersHaveFactories.test
>   [junit4]
>   [junit4]
>   [junit4] JVM J0:     0.66 ..     6.09 =     5.44s
>   [junit4] Execution time total: 6.11 sec.
>   [junit4] Tests summary: 1 suite, 1 test, 1 error
> ================================================
> 
> Running the test under debugger in Eclipse, it gives the same error
> message for a different Factory class 'DaitchMokitoffSoundex'. This
> may or may not be related to my issue, not sure.
> 
> My guess is there is some sort of class loader issue. My understanding
> of the test is that it is making sure there is a corresponding
> TokenFilter Factory for a TokenFilter. In this case that would be
> AutoPhrasingTokenFilterFactory. Now, I checked to make sure the class
> is created. The 'find' command shows the class at:
> 
> build/analysis/common/classes/java/org/apache/lucene/analysis/autophrase/AutoPhrasingTokenFilterFactory.class
> 
> The location is similar to other Filter factories.
> 
> I have put in print statement as well as running the test in Eclipse
> debugger. As far as I can see, the test code sees the
> AutoPhrasingTokenFilter. Looking at
> TestAllAnalyzersHaveFactories.java, at line marked with '1>', the test
> code picks up the class AutoPhrasingTokenFilter. However, when it gets
> to line '2>', it fails:
> 
> ===========================================
>  public void test() throws Exception {
> 1>    List<Class<?>> analysisClasses =
> TestRandomChains.getClassesForPackage("org.apache.lucene.analysis");
> 
>    ClassLoader cl = ClassLoader.getSystemClassLoader();
> 
>    URL[] urls = ((URLClassLoader)cl).getURLs();
> //    System.out.println("ClassPath Start:");
>    for(URL url: urls){
> //      System.out.println(url.getFile());
>    }
> //    System.out.println("ClassPath Ends!");
> 
>    for (final Class<?> c : analysisClasses) {
>      final int modifiers = c.getModifiers();
>      if (
>        // don't waste time with abstract classes
>        Modifier.isAbstract(modifiers) || !Modifier.isPublic(modifiers)
>        || c.isSynthetic() || c.isAnonymousClass() ||
> c.isMemberClass() || c.isInterface()
>        || testComponents.contains(c)
>        || crazyComponents.contains(c)
>        || oddlyNamedComponents.contains(c)
>        || c.isAnnotationPresent(Deprecated.class) // deprecated ones
> are typically back compat hacks
>        || !(Tokenizer.class.isAssignableFrom(c) ||
> TokenFilter.class.isAssignableFrom(c) ||
> CharFilter.class.isAssignableFrom(c))
>      ) {
>        continue;
>      }
> 
>      Map<String,String> args = new HashMap<>();
>      args.put("luceneMatchVersion", Version.LATEST.toString());
> 
>      if (Tokenizer.class.isAssignableFrom(c)) {
>        String clazzName = c.getSimpleName();
>        assertTrue(clazzName.endsWith("Tokenizer"));
>        String simpleName = clazzName.substring(0, clazzName.length() - 9);
>        assertNotNull(TokenizerFactory.lookupClass(simpleName));
>        TokenizerFactory instance = null;
>        try {
>          instance = TokenizerFactory.forName(simpleName, args);
>          assertNotNull(instance);
>          if (instance instanceof ResourceLoaderAware) {
>            ((ResourceLoaderAware) instance).inform(loader);
>          }
>          assertSame(c, instance.create().getClass());
>        } catch (IllegalArgumentException e) {
>          if (e.getCause() instanceof NoSuchMethodException) {
>            // there is no corresponding ctor available
>            throw e;
>          }
>          // TODO: For now pass because some factories have not yet a
> default config that always works
>        }
>      } else if (TokenFilter.class.isAssignableFrom(c)) {
>        String clazzName = c.getSimpleName();
>        System.out.println("clazzName: " + clazzName);
>        assertTrue(clazzName.endsWith("Filter"));
>        String simpleName = clazzName.substring(0, clazzName.length()
> - (clazzName.endsWith("TokenFilter") ? 11 : 6));
>        System.out.println("simpleName: " + simpleName);
> 2>        assertNotNull(TokenFilterFactory.lookupClass(simpleName));
> =====================================================
> 
> Here is the code for the factory class:
> 
> package org.apache.lucene.analysis.autophrase;
> 
> /*
> * Copyright 2015 Synopsys, Inc.
> *
> * Licensed under the Apache License, Version 2.0 (the "License"); you
> * may not use this file except in compliance with the License. You may
> * obtain a copy of the License at
> *
> *     http://www.apache.org/licenses/LICENSE-2.0
> *
> * Unless required by applicable law or agreed to in writing, software
> * distributed under the License is distributed on an "AS IS" BASIS,
> * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> * See the License for the specific language governing permissions and
> * limitations under the License.
> */
> 
> import java.io.IOException;
> import java.util.Map;
> 
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.util.CharArraySet;
> import org.apache.lucene.analysis.util.ResourceLoader;
> import org.apache.lucene.analysis.util.ResourceLoaderAware;
> import org.apache.lucene.analysis.util.TokenFilterFactory;
> 
> public class AutoPhrasingTokenFilterFactory extends TokenFilterFactory
> implements ResourceLoaderAware {
> 
>  private CharArraySet phraseSets;
>  private final String phraseSetFiles;
>  private final boolean ignoreCase;
>  private final boolean emitSingleTokens;
>  private final boolean quotePhrase;
>  private final boolean emitAmbiguousPhrases;
> 
>  private String replaceWhitespaceWith = null;
> 
>  public AutoPhrasingTokenFilterFactory(Map<String, String> initArgs) {
>    super( initArgs );
>    phraseSetFiles = get(initArgs, "phrases");
>    ignoreCase = getBoolean( initArgs, "ignoreCase", false);
>    emitSingleTokens = getBoolean( initArgs, "includeTokens", false );
>    quotePhrase = getBoolean( initArgs, "quotePhrase", false );
>    emitAmbiguousPhrases = getBoolean( initArgs,
> "emitAmbiguousPhrases", false );
> 
>  String replaceWhitespaceArg = initArgs.get( "replaceWhitespaceWith" );
>  if (replaceWhitespaceArg != null) {
>      replaceWhitespaceWith = replaceWhitespaceArg;
>    }
>  }
> 
>  @Override
>  public void inform(ResourceLoader loader) throws IOException {
>    if (phraseSetFiles != null) {
>      phraseSets = getWordSet(loader, phraseSetFiles, ignoreCase);
>    }
>  }
> 
>  @Override
>  public TokenStream create( TokenStream input ) {
>    AutoPhrasingTokenFilter autoPhraseFilter = new
> AutoPhrasingTokenFilter( input, phraseSets, emitSingleTokens );
>    if (replaceWhitespaceWith != null) {
>      autoPhraseFilter.setReplaceWhitespaceWith( new Character(
> replaceWhitespaceWith.charAt( 0 )) );
>    }
>    //Doesn't make send to emit phrases in double quotes if
> replaceWhitespaceWith character is set.
>    if ((replaceWhitespaceWith == null) && quotePhrase) {
>      autoPhraseFilter.setQuotePhrase(quotePhrase);
>    }
>    if (emitAmbiguousPhrases) {
>        autoPhraseFilter.setEmitAmbiguousPhrases(emitAmbiguousPhrases);
>    }
>    return autoPhraseFilter;
>  }
> }
> 
> Thanks,
> 
> Koorosh
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 

Reply via email to