Re: issue with singleton analyzer in single JVM multi-index setup

Dmitry Kan Wed, 18 Mar 2015 10:42:11 -0700

Jörg,

Thanks for replying!


Here is the code of the RussianLemmatizingTwitterAnalyzer, the deepest 
class in the simplified class sequence I have posted in the original 
message.

[code]

public class RussianLemmatizingTwitterAnalyzer extends Analyzer {

    private static MorphAnalyzer morphAnalyzerGlobal;

    boolean useSyncMethod = true;

    private static final boolean verbose = false;
    private MorphAnalyzer morphAnalyzer;
    private boolean analyzeBest = false;

    private static final Logger Log = 
Logger.getLogger(RussianLemmatizingTwitterAnalyzer.class.getName());

    public RussianLemmatizingTwitterAnalyzer(String lemmatizerConfFile, boolean 
analyzeBest) throws IOException {
        this.analyzeBest = analyzeBest;

        if (useSyncMethod) {
            this.morphAnalyzer = loadCustomAnalyzer(lemmatizerConfFile);
        } else {
            Properties properties = new Properties();

            Log.info("Loading lemmatizer properties from " + 
lemmatizerConfFile);

            properties.load(new StringReader(IOUtils.readFile(new 
File(lemmatizerConfFile), Charsets.UTF_8)));
            this.morphAnalyzer = MorphAnalyzerLoader.load(new 
MorphAnalyzerConfig(properties));
        }
    }

    private static MorphAnalyzer loadAnalyzer(String lemmatizerConfFile) throws 
IOException {
        Properties properties = new Properties();

        Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);

        properties.load(new StringReader(IOUtils.readFile(new 
File(lemmatizerConfFile), Charsets.UTF_8)));
        MorphAnalyzer morphAnalyzer1 = MorphAnalyzerLoader.load(new 
MorphAnalyzerConfig(properties));

        if (verbose) {
            if (morphAnalyzer1 != null) {
                Log.info("Successfully created the analyzer!");
                Log.info(morphAnalyzer1.analyzeBest("билета").toString());
            } else {
                Log.severe("Failed to create the morphAnalyzer object");
            }
        }

        return morphAnalyzer1;
    }

    public static synchronized MorphAnalyzer loadCustomAnalyzer(String 
lemmatizerConfFile)
            throws IOException {
        if (morphAnalyzerGlobal == null) {
            morphAnalyzerGlobal = loadAnalyzer(lemmatizerConfFile);
        }

        return morphAnalyzerGlobal;
    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName, final 
Reader reader) {
        Tokenizer tokenizer = new TwitterFlexLuceneTokenizer(reader);

        Log.config("Using Tokenizer: " + tokenizer.getClass().getSimpleName());

        TokenStream tokenStream = tokenizer;
        tokenStream = new LowerCaseFilter(Version.LUCENE_4_9,tokenStream);
        tokenStream = new MorphAnalTokenFilter(tokenStream, morphAnalyzer, 
analyzeBest);
        return new TokenStreamComponents(tokenizer, tokenStream);
    }

}


[/code] 

Note, that in the code above the TwitterFlexLuceneTokenizer is not thread 
safe and extends o.a.lucene.analysis.Tokenizer. In jvisualvm there are 97 
instances of this class.

Let me know, if I should copy other code snippets up the class stream.

Dmitry

On Wednesday, 18 March 2015 18:22:47 UTC+2, Jörg Prante wrote:
>
> Is it possible to examine the code of your plugin?
>
> Generally speaking, analyzers are instantiated per index creation for each 
> thread.
>
> In org.elasticsearch.index.analysis.AnalysisModule, you can see how 
> analyzer providers and factories are prepared for injection by the help of 
> the ES injection modul which is based on Guice. Basically, the factories 
> are kept as singletons, and each thread can pick analyzer instances from 
> the factory when needed. All in all, Lucene analyzer classes are not 
> threadsafe, in particular the tokenizers. It means, it is up to the 
> implementor of an analyzer/tokenizer to store immutable objects as 
> singletons in a correct way so that all threads can safely access them.
>
> Jörg
>
> On Wed, Mar 18, 2015 at 4:02 PM, Dmitry Kan <[email protected] 
> <javascript:>> wrote:
>
>> Hi,
>>
>> Could somebody answer, please?
>>
>>
>> On Tuesday, 17 March 2015 19:05:38 UTC+2, Dmitry Kan wrote:
>>>
>>> Hello!
>>>
>>> I'm a newbie in elasticsearch, so forgive if the question is lame.
>>>
>>> I have implemented a custom plugin using a custom lemmatizer and a 
>>> tokenizer. The simplified class sequence: 
>>>
>>>
>>> AnalysisMorphologyPlugin->MorphologyAnalysisBinderProcessor->SemanticAnalyzerTwitterLemmatizerProvider->RussianLemmatizingTwitterAnalyzer
>>>
>>> In the RussianLemmatizingTwitterAnalyzer's ctor I load the custom object 
>>> for lemmatization (object unrelated to lucene/es) in a singleton fashion 
>>> (in a syncrhonized code block).
>>> Then, when creating 14 indices in the same JVM I see 
>>>  14 instances of RussianLemmatizingTwitterAnalyzer, 
>>>  4 instances of SemanticAnalyzerTwitterLemmatizerProvider, 
>>>  4 instances of MorphologyAnalysisBinderProcessor,
>>>  30 instances of the custom lemmatizer (in each 
>>> RussianLemmatizingTwitterAnalyzer only one instance is expected, so should 
>>> be 14), 
>>>  1 instance of AnalysisMorphologyPlugin.
>>>
>>> The question is, can RussianLemmatizingTwitterAnalyzer object be made 
>>> shared between indices? Or is it by design, that they must load separately 
>>> per index?
>>> What could be wrong in the code that makes 30 instances of the custom 
>>> singleton lemmatizer instead of 14?
>>>
>>> The current standing is that *with* the plugin 100M of RAM is reserved by 
>>> the JVM with no data. *Without* the plugin the JVM reserves 2M with no 
>>> data. Elasticsearch 1.3.2, Lucene 4.9.0.
>>>
>>> Regards,
>>>
>>> Dmitry Kan
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: issue with singleton analyzer in single JVM multi-index setup

Reply via email to