Re: issue with singleton analyzer in single JVM multi-index setup

[email protected] Wed, 18 Mar 2015 12:28:11 -0700

In the get() method of the provider, I would better try to always return a
new analyzer instance.


The configuration and setup of the analyzer could be refactored to the
provider.

Jörg

On Wed, Mar 18, 2015 at 8:12 PM, Dmitry Kan <[email protected]> wrote:

> Yes, I use an analyzer provider. Here is the code:
>
> [code]
>
> public class SemanticAnalyzerTwitterLemmatizerProvider extends 
> AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> {
>     private final RussianLemmatizingTwitterAnalyzer 
> russianLemmatizingGenericAnalyzer;
>
>     private final Logger Log = 
> Logger.getLogger(SemanticAnalyzerTwitterLemmatizerProvider.class.getName());
>
>     @Inject
>     public SemanticAnalyzerTwitterLemmatizerProvider(Index index, 
> @IndexSettings Settings indexSettings,
>                                                      @Assisted String name, 
> Settings settings) {
>         super(index, indexSettings, name, settings);
>         Log.info("called super with name=" + name);
>         try {
>             String lemmatizerConfFile = settings.get("lemmatizerConf");
>             boolean analyzeBest = 
> Boolean.parseBoolean(settings.get("analyzeBest"));
>             russianLemmatizingGenericAnalyzer = new 
> RussianLemmatizingTwitterAnalyzer(lemmatizerConfFile, analyzeBest);
>         } catch (IOException ioe) {
>             throw new ElasticsearchIllegalArgumentException("Unable to load 
> Russian morphology analyzer", ioe);
>         } catch (Exception e) {
>             throw new ElasticsearchIllegalArgumentException("Unable to load 
> Russian morphology analyzer", e);
>         }
>     }
>
>     @Override
>     public RussianLemmatizingTwitterAnalyzer get() {
>         return russianLemmatizingGenericAnalyzer;
>     }
> }
>
>
> [/code]
>
> Would you recommend to use your approach instead of this one? Do you spot
> issues in my implementation of the provider?
>
> On Wednesday, 18 March 2015 20:52:08 UTC+2, Jörg Prante wrote:
>>
>> Do you use an analyzer provider?
>>
>> Example
>>
>> public class RussianLemmatizingTwitterAnalyzerProvider extends
>> AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> {
>>
>>     private final MorphAnalyzer morphAnalyzer;
>>
>>     ...
>>
>>     @Inject
>>     public RussianLemmatizingTwitterAnalyzerProvider(Index index,
>>                                            @IndexSettings Settings
>> indexSettings,
>>                                            Environment environment,
>>                                            @Assisted String name,
>> @Assisted Settings settings) {
>>         super(index, indexSettings, name, settings);
>>         this.morphAnalyzer = createMorphAnalyzer(environment, settings,
>> ...);
>>     }
>>
>>     @Override
>>     public RussianLemmatizingTwitterAnalyzer get() {
>>         return new RussianLemmatizingTwitterAnalyzer(morphAnalyzer, ...);
>>     }
>>
>>     private MorphAnalyzer createMorphAnalyzer(...) {
>>     }
>>
>> }
>>
>>
>> Only such a provider is bound to a singleton. So the analyzer provider
>> can set up the analyzer configuration exactly once (with a MorphAnalyzer
>> instance etc.), and with get() method, it creates analyzers as required.
>>
>> Jörg
>>
>> On Wed, Mar 18, 2015 at 6:40 PM, Dmitry Kan <[email protected]> wrote:
>>
>>>
>>> Jörg,
>>>
>>> Thanks for replying!
>>>
>>> Here is the code of the RussianLemmatizingTwitterAnalyzer, the deepest
>>> class in the simplified class sequence I have posted in the original
>>> message.
>>>
>>> [code]
>>>
>>> public class RussianLemmatizingTwitterAnalyzer extends Analyzer {
>>>
>>>     private static MorphAnalyzer morphAnalyzerGlobal;
>>>
>>>     boolean useSyncMethod = true;
>>>
>>>     private static final boolean verbose = false;
>>>     private MorphAnalyzer morphAnalyzer;
>>>     private boolean analyzeBest = false;
>>>
>>>     private static final Logger Log = 
>>> Logger.getLogger(RussianLemmatizingTwitterAnalyzer.class.getName());
>>>
>>>     public RussianLemmatizingTwitterAnalyzer(String lemmatizerConfFile, 
>>> boolean analyzeBest) throws IOException {
>>>         this.analyzeBest = analyzeBest;
>>>
>>>         if (useSyncMethod) {
>>>             this.morphAnalyzer = loadCustomAnalyzer(lemmatizerConfFile);
>>>         } else {
>>>             Properties properties = new Properties();
>>>
>>>             Log.info("Loading lemmatizer properties from " + 
>>> lemmatizerConfFile);
>>>
>>>             properties.load(new StringReader(IOUtils.readFile(new 
>>> File(lemmatizerConfFile), Charsets.UTF_8)));
>>>             this.morphAnalyzer = MorphAnalyzerLoader.load(new 
>>> MorphAnalyzerConfig(properties));
>>>         }
>>>     }
>>>
>>>     private static MorphAnalyzer loadAnalyzer(String lemmatizerConfFile) 
>>> throws IOException {
>>>         Properties properties = new Properties();
>>>
>>>         Log.info("Loading lemmatizer properties from " + 
>>> lemmatizerConfFile);
>>>
>>>         properties.load(new StringReader(IOUtils.readFile(new 
>>> File(lemmatizerConfFile), Charsets.UTF_8)));
>>>         MorphAnalyzer morphAnalyzer1 = MorphAnalyzerLoader.load(new 
>>> MorphAnalyzerConfig(properties));
>>>
>>>         if (verbose) {
>>>             if (morphAnalyzer1 != null) {
>>>                 Log.info("Successfully created the analyzer!");
>>>                 Log.info(morphAnalyzer1.analyzeBest("билета").toString());
>>>             } else {
>>>                 Log.severe("Failed to create the morphAnalyzer object");
>>>             }
>>>         }
>>>
>>>         return morphAnalyzer1;
>>>     }
>>>
>>>     public static synchronized MorphAnalyzer loadCustomAnalyzer(String 
>>> lemmatizerConfFile)
>>>             throws IOException {
>>>         if (morphAnalyzerGlobal == null) {
>>>             morphAnalyzerGlobal = loadAnalyzer(lemmatizerConfFile);
>>>         }
>>>
>>>         return morphAnalyzerGlobal;
>>>     }
>>>
>>>     @Override
>>>     protected TokenStreamComponents createComponents(String fieldName, 
>>> final Reader reader) {
>>>         Tokenizer tokenizer = new TwitterFlexLuceneTokenizer(reader);
>>>
>>>         Log.config("Using Tokenizer: " + 
>>> tokenizer.getClass().getSimpleName());
>>>
>>>         TokenStream tokenStream = tokenizer;
>>>         tokenStream = new LowerCaseFilter(Version.LUCENE_4_9,tokenStream);
>>>         tokenStream = new MorphAnalTokenFilter(tokenStream, morphAnalyzer, 
>>> analyzeBest);
>>>         return new TokenStreamComponents(tokenizer, tokenStream);
>>>     }
>>>
>>> }
>>>
>>>
>>> [/code]
>>>
>>> Note, that in the code above the TwitterFlexLuceneTokenizer is not
>>> thread safe and extends o.a.lucene.analysis.Tokenizer. In jvisualvm there
>>> are 97 instances of this class.
>>>
>>> Let me know, if I should copy other code snippets up the class stream.
>>>
>>> Dmitry
>>>
>>> On Wednesday, 18 March 2015 18:22:47 UTC+2, Jörg Prante wrote:
>>>>
>>>> Is it possible to examine the code of your plugin?
>>>>
>>>> Generally speaking, analyzers are instantiated per index creation for
>>>> each thread.
>>>>
>>>> In org.elasticsearch.index.analysis.AnalysisModule, you can see how
>>>> analyzer providers and factories are prepared for injection by the help of
>>>> the ES injection modul which is based on Guice. Basically, the factories
>>>> are kept as singletons, and each thread can pick analyzer instances from
>>>> the factory when needed. All in all, Lucene analyzer classes are not
>>>> threadsafe, in particular the tokenizers. It means, it is up to the
>>>> implementor of an analyzer/tokenizer to store immutable objects as
>>>> singletons in a correct way so that all threads can safely access them.
>>>>
>>>> Jörg
>>>>
>>>> On Wed, Mar 18, 2015 at 4:02 PM, Dmitry Kan <[email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Could somebody answer, please?
>>>>>
>>>>>
>>>>> On Tuesday, 17 March 2015 19:05:38 UTC+2, Dmitry Kan wrote:
>>>>>>
>>>>>> Hello!
>>>>>>
>>>>>> I'm a newbie in elasticsearch, so forgive if the question is lame.
>>>>>>
>>>>>> I have implemented a custom plugin using a custom lemmatizer and a
>>>>>> tokenizer. The simplified class sequence:
>>>>>>
>>>>>>
>>>>>> AnalysisMorphologyPlugin->MorphologyAnalysisBinderProcessor->SemanticAnalyzerTwitterLemmatizerProvider->RussianLemmatizingTwitterAnalyzer
>>>>>>
>>>>>> In the RussianLemmatizingTwitterAnalyzer's ctor I load the custom object 
>>>>>> for lemmatization (object unrelated to lucene/es) in a singleton fashion 
>>>>>> (in a syncrhonized code block).
>>>>>> Then, when creating 14 indices in the same JVM I see
>>>>>>  14 instances of RussianLemmatizingTwitterAnalyzer,
>>>>>>  4 instances of SemanticAnalyzerTwitterLemmatizerProvider,
>>>>>>  4 instances of MorphologyAnalysisBinderProcessor,
>>>>>>  30 instances of the custom lemmatizer (in each 
>>>>>> RussianLemmatizingTwitterAnalyzer only one instance is expected, so 
>>>>>> should be 14),
>>>>>>  1 instance of AnalysisMorphologyPlugin.
>>>>>>
>>>>>> The question is, can RussianLemmatizingTwitterAnalyzer object be made 
>>>>>> shared between indices? Or is it by design, that they must load 
>>>>>> separately per index?
>>>>>> What could be wrong in the code that makes 30 instances of the custom 
>>>>>> singleton lemmatizer instead of 14?
>>>>>>
>>>>>> The current standing is that *with* the plugin 100M of RAM is reserved 
>>>>>> by the JVM with no data. *Without* the plugin the JVM reserves 2M with 
>>>>>> no data. Elasticsearch 1.3.2, Lucene 4.9.0.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Dmitry Kan
>>>>>>
>>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40goo
>>>>> glegroups.com
>>>>> <https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/b80d3251-c0b4-4081-8a6b-f585b3e3c60d%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/b80d3251-c0b4-4081-8a6b-f585b3e3c60d%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHkS-kFjA1CoxWYNzsgD60sqc7KZxYX-Kysw1pFCAB%2BFA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: issue with singleton analyzer in single JVM multi-index setup

Reply via email to