In the get() method of the provider, I would better try to always return a new analyzer instance.
The configuration and setup of the analyzer could be refactored to the provider. Jörg On Wed, Mar 18, 2015 at 8:12 PM, Dmitry Kan <[email protected]> wrote: > Yes, I use an analyzer provider. Here is the code: > > [code] > > public class SemanticAnalyzerTwitterLemmatizerProvider extends > AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> { > private final RussianLemmatizingTwitterAnalyzer > russianLemmatizingGenericAnalyzer; > > private final Logger Log = > Logger.getLogger(SemanticAnalyzerTwitterLemmatizerProvider.class.getName()); > > @Inject > public SemanticAnalyzerTwitterLemmatizerProvider(Index index, > @IndexSettings Settings indexSettings, > @Assisted String name, > Settings settings) { > super(index, indexSettings, name, settings); > Log.info("called super with name=" + name); > try { > String lemmatizerConfFile = settings.get("lemmatizerConf"); > boolean analyzeBest = > Boolean.parseBoolean(settings.get("analyzeBest")); > russianLemmatizingGenericAnalyzer = new > RussianLemmatizingTwitterAnalyzer(lemmatizerConfFile, analyzeBest); > } catch (IOException ioe) { > throw new ElasticsearchIllegalArgumentException("Unable to load > Russian morphology analyzer", ioe); > } catch (Exception e) { > throw new ElasticsearchIllegalArgumentException("Unable to load > Russian morphology analyzer", e); > } > } > > @Override > public RussianLemmatizingTwitterAnalyzer get() { > return russianLemmatizingGenericAnalyzer; > } > } > > > [/code] > > Would you recommend to use your approach instead of this one? Do you spot > issues in my implementation of the provider? > > On Wednesday, 18 March 2015 20:52:08 UTC+2, Jörg Prante wrote: >> >> Do you use an analyzer provider? >> >> Example >> >> public class RussianLemmatizingTwitterAnalyzerProvider extends >> AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> { >> >> private final MorphAnalyzer morphAnalyzer; >> >> ... >> >> @Inject >> public RussianLemmatizingTwitterAnalyzerProvider(Index index, >> @IndexSettings Settings >> indexSettings, >> Environment environment, >> @Assisted String name, >> @Assisted Settings settings) { >> super(index, indexSettings, name, settings); >> this.morphAnalyzer = createMorphAnalyzer(environment, settings, >> ...); >> } >> >> @Override >> public RussianLemmatizingTwitterAnalyzer get() { >> return new RussianLemmatizingTwitterAnalyzer(morphAnalyzer, ...); >> } >> >> private MorphAnalyzer createMorphAnalyzer(...) { >> } >> >> } >> >> >> Only such a provider is bound to a singleton. So the analyzer provider >> can set up the analyzer configuration exactly once (with a MorphAnalyzer >> instance etc.), and with get() method, it creates analyzers as required. >> >> Jörg >> >> On Wed, Mar 18, 2015 at 6:40 PM, Dmitry Kan <[email protected]> wrote: >> >>> >>> Jörg, >>> >>> Thanks for replying! >>> >>> Here is the code of the RussianLemmatizingTwitterAnalyzer, the deepest >>> class in the simplified class sequence I have posted in the original >>> message. >>> >>> [code] >>> >>> public class RussianLemmatizingTwitterAnalyzer extends Analyzer { >>> >>> private static MorphAnalyzer morphAnalyzerGlobal; >>> >>> boolean useSyncMethod = true; >>> >>> private static final boolean verbose = false; >>> private MorphAnalyzer morphAnalyzer; >>> private boolean analyzeBest = false; >>> >>> private static final Logger Log = >>> Logger.getLogger(RussianLemmatizingTwitterAnalyzer.class.getName()); >>> >>> public RussianLemmatizingTwitterAnalyzer(String lemmatizerConfFile, >>> boolean analyzeBest) throws IOException { >>> this.analyzeBest = analyzeBest; >>> >>> if (useSyncMethod) { >>> this.morphAnalyzer = loadCustomAnalyzer(lemmatizerConfFile); >>> } else { >>> Properties properties = new Properties(); >>> >>> Log.info("Loading lemmatizer properties from " + >>> lemmatizerConfFile); >>> >>> properties.load(new StringReader(IOUtils.readFile(new >>> File(lemmatizerConfFile), Charsets.UTF_8))); >>> this.morphAnalyzer = MorphAnalyzerLoader.load(new >>> MorphAnalyzerConfig(properties)); >>> } >>> } >>> >>> private static MorphAnalyzer loadAnalyzer(String lemmatizerConfFile) >>> throws IOException { >>> Properties properties = new Properties(); >>> >>> Log.info("Loading lemmatizer properties from " + >>> lemmatizerConfFile); >>> >>> properties.load(new StringReader(IOUtils.readFile(new >>> File(lemmatizerConfFile), Charsets.UTF_8))); >>> MorphAnalyzer morphAnalyzer1 = MorphAnalyzerLoader.load(new >>> MorphAnalyzerConfig(properties)); >>> >>> if (verbose) { >>> if (morphAnalyzer1 != null) { >>> Log.info("Successfully created the analyzer!"); >>> Log.info(morphAnalyzer1.analyzeBest("билета").toString()); >>> } else { >>> Log.severe("Failed to create the morphAnalyzer object"); >>> } >>> } >>> >>> return morphAnalyzer1; >>> } >>> >>> public static synchronized MorphAnalyzer loadCustomAnalyzer(String >>> lemmatizerConfFile) >>> throws IOException { >>> if (morphAnalyzerGlobal == null) { >>> morphAnalyzerGlobal = loadAnalyzer(lemmatizerConfFile); >>> } >>> >>> return morphAnalyzerGlobal; >>> } >>> >>> @Override >>> protected TokenStreamComponents createComponents(String fieldName, >>> final Reader reader) { >>> Tokenizer tokenizer = new TwitterFlexLuceneTokenizer(reader); >>> >>> Log.config("Using Tokenizer: " + >>> tokenizer.getClass().getSimpleName()); >>> >>> TokenStream tokenStream = tokenizer; >>> tokenStream = new LowerCaseFilter(Version.LUCENE_4_9,tokenStream); >>> tokenStream = new MorphAnalTokenFilter(tokenStream, morphAnalyzer, >>> analyzeBest); >>> return new TokenStreamComponents(tokenizer, tokenStream); >>> } >>> >>> } >>> >>> >>> [/code] >>> >>> Note, that in the code above the TwitterFlexLuceneTokenizer is not >>> thread safe and extends o.a.lucene.analysis.Tokenizer. In jvisualvm there >>> are 97 instances of this class. >>> >>> Let me know, if I should copy other code snippets up the class stream. >>> >>> Dmitry >>> >>> On Wednesday, 18 March 2015 18:22:47 UTC+2, Jörg Prante wrote: >>>> >>>> Is it possible to examine the code of your plugin? >>>> >>>> Generally speaking, analyzers are instantiated per index creation for >>>> each thread. >>>> >>>> In org.elasticsearch.index.analysis.AnalysisModule, you can see how >>>> analyzer providers and factories are prepared for injection by the help of >>>> the ES injection modul which is based on Guice. Basically, the factories >>>> are kept as singletons, and each thread can pick analyzer instances from >>>> the factory when needed. All in all, Lucene analyzer classes are not >>>> threadsafe, in particular the tokenizers. It means, it is up to the >>>> implementor of an analyzer/tokenizer to store immutable objects as >>>> singletons in a correct way so that all threads can safely access them. >>>> >>>> Jörg >>>> >>>> On Wed, Mar 18, 2015 at 4:02 PM, Dmitry Kan <[email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> Could somebody answer, please? >>>>> >>>>> >>>>> On Tuesday, 17 March 2015 19:05:38 UTC+2, Dmitry Kan wrote: >>>>>> >>>>>> Hello! >>>>>> >>>>>> I'm a newbie in elasticsearch, so forgive if the question is lame. >>>>>> >>>>>> I have implemented a custom plugin using a custom lemmatizer and a >>>>>> tokenizer. The simplified class sequence: >>>>>> >>>>>> >>>>>> AnalysisMorphologyPlugin->MorphologyAnalysisBinderProcessor->SemanticAnalyzerTwitterLemmatizerProvider->RussianLemmatizingTwitterAnalyzer >>>>>> >>>>>> In the RussianLemmatizingTwitterAnalyzer's ctor I load the custom object >>>>>> for lemmatization (object unrelated to lucene/es) in a singleton fashion >>>>>> (in a syncrhonized code block). >>>>>> Then, when creating 14 indices in the same JVM I see >>>>>> 14 instances of RussianLemmatizingTwitterAnalyzer, >>>>>> 4 instances of SemanticAnalyzerTwitterLemmatizerProvider, >>>>>> 4 instances of MorphologyAnalysisBinderProcessor, >>>>>> 30 instances of the custom lemmatizer (in each >>>>>> RussianLemmatizingTwitterAnalyzer only one instance is expected, so >>>>>> should be 14), >>>>>> 1 instance of AnalysisMorphologyPlugin. >>>>>> >>>>>> The question is, can RussianLemmatizingTwitterAnalyzer object be made >>>>>> shared between indices? Or is it by design, that they must load >>>>>> separately per index? >>>>>> What could be wrong in the code that makes 30 instances of the custom >>>>>> singleton lemmatizer instead of 14? >>>>>> >>>>>> The current standing is that *with* the plugin 100M of RAM is reserved >>>>>> by the JVM with no data. *Without* the plugin the JVM reserves 2M with >>>>>> no data. Elasticsearch 1.3.2, Lucene 4.9.0. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Dmitry Kan >>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "elasticsearch" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>> msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40goo >>>>> glegroups.com >>>>> <https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elasticsearch" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit https://groups.google.com/d/ >>> msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88% >>> 40googlegroups.com >>> <https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/b80d3251-c0b4-4081-8a6b-f585b3e3c60d%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/b80d3251-c0b4-4081-8a6b-f585b3e3c60d%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHkS-kFjA1CoxWYNzsgD60sqc7KZxYX-Kysw1pFCAB%2BFA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
