Do you use an analyzer provider?

Example

public class RussianLemmatizingTwitterAnalyzerProvider extends
AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> {

    private final MorphAnalyzer morphAnalyzer;

    ...

    @Inject
    public RussianLemmatizingTwitterAnalyzerProvider(Index index,
                                           @IndexSettings Settings
indexSettings,
                                           Environment environment,
                                           @Assisted String name, @Assisted
Settings settings) {
        super(index, indexSettings, name, settings);
        this.morphAnalyzer = createMorphAnalyzer(environment, settings,
...);
    }

    @Override
    public RussianLemmatizingTwitterAnalyzer get() {
        return new RussianLemmatizingTwitterAnalyzer(morphAnalyzer, ...);
    }

    private MorphAnalyzer createMorphAnalyzer(...) {
    }

}


Only such a provider is bound to a singleton. So the analyzer provider can
set up the analyzer configuration exactly once (with a MorphAnalyzer
instance etc.), and with get() method, it creates analyzers as required.

Jörg

On Wed, Mar 18, 2015 at 6:40 PM, Dmitry Kan <[email protected]> wrote:

>
> Jörg,
>
> Thanks for replying!
>
> Here is the code of the RussianLemmatizingTwitterAnalyzer, the deepest
> class in the simplified class sequence I have posted in the original
> message.
>
> [code]
>
> public class RussianLemmatizingTwitterAnalyzer extends Analyzer {
>
>     private static MorphAnalyzer morphAnalyzerGlobal;
>
>     boolean useSyncMethod = true;
>
>     private static final boolean verbose = false;
>     private MorphAnalyzer morphAnalyzer;
>     private boolean analyzeBest = false;
>
>     private static final Logger Log = 
> Logger.getLogger(RussianLemmatizingTwitterAnalyzer.class.getName());
>
>     public RussianLemmatizingTwitterAnalyzer(String lemmatizerConfFile, 
> boolean analyzeBest) throws IOException {
>         this.analyzeBest = analyzeBest;
>
>         if (useSyncMethod) {
>             this.morphAnalyzer = loadCustomAnalyzer(lemmatizerConfFile);
>         } else {
>             Properties properties = new Properties();
>
>             Log.info("Loading lemmatizer properties from " + 
> lemmatizerConfFile);
>
>             properties.load(new StringReader(IOUtils.readFile(new 
> File(lemmatizerConfFile), Charsets.UTF_8)));
>             this.morphAnalyzer = MorphAnalyzerLoader.load(new 
> MorphAnalyzerConfig(properties));
>         }
>     }
>
>     private static MorphAnalyzer loadAnalyzer(String lemmatizerConfFile) 
> throws IOException {
>         Properties properties = new Properties();
>
>         Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);
>
>         properties.load(new StringReader(IOUtils.readFile(new 
> File(lemmatizerConfFile), Charsets.UTF_8)));
>         MorphAnalyzer morphAnalyzer1 = MorphAnalyzerLoader.load(new 
> MorphAnalyzerConfig(properties));
>
>         if (verbose) {
>             if (morphAnalyzer1 != null) {
>                 Log.info("Successfully created the analyzer!");
>                 Log.info(morphAnalyzer1.analyzeBest("билета").toString());
>             } else {
>                 Log.severe("Failed to create the morphAnalyzer object");
>             }
>         }
>
>         return morphAnalyzer1;
>     }
>
>     public static synchronized MorphAnalyzer loadCustomAnalyzer(String 
> lemmatizerConfFile)
>             throws IOException {
>         if (morphAnalyzerGlobal == null) {
>             morphAnalyzerGlobal = loadAnalyzer(lemmatizerConfFile);
>         }
>
>         return morphAnalyzerGlobal;
>     }
>
>     @Override
>     protected TokenStreamComponents createComponents(String fieldName, final 
> Reader reader) {
>         Tokenizer tokenizer = new TwitterFlexLuceneTokenizer(reader);
>
>         Log.config("Using Tokenizer: " + 
> tokenizer.getClass().getSimpleName());
>
>         TokenStream tokenStream = tokenizer;
>         tokenStream = new LowerCaseFilter(Version.LUCENE_4_9,tokenStream);
>         tokenStream = new MorphAnalTokenFilter(tokenStream, morphAnalyzer, 
> analyzeBest);
>         return new TokenStreamComponents(tokenizer, tokenStream);
>     }
>
> }
>
>
> [/code]
>
> Note, that in the code above the TwitterFlexLuceneTokenizer is not thread
> safe and extends o.a.lucene.analysis.Tokenizer. In jvisualvm there are 97
> instances of this class.
>
> Let me know, if I should copy other code snippets up the class stream.
>
> Dmitry
>
> On Wednesday, 18 March 2015 18:22:47 UTC+2, Jörg Prante wrote:
>>
>> Is it possible to examine the code of your plugin?
>>
>> Generally speaking, analyzers are instantiated per index creation for
>> each thread.
>>
>> In org.elasticsearch.index.analysis.AnalysisModule, you can see how
>> analyzer providers and factories are prepared for injection by the help of
>> the ES injection modul which is based on Guice. Basically, the factories
>> are kept as singletons, and each thread can pick analyzer instances from
>> the factory when needed. All in all, Lucene analyzer classes are not
>> threadsafe, in particular the tokenizers. It means, it is up to the
>> implementor of an analyzer/tokenizer to store immutable objects as
>> singletons in a correct way so that all threads can safely access them.
>>
>> Jörg
>>
>> On Wed, Mar 18, 2015 at 4:02 PM, Dmitry Kan <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> Could somebody answer, please?
>>>
>>>
>>> On Tuesday, 17 March 2015 19:05:38 UTC+2, Dmitry Kan wrote:
>>>>
>>>> Hello!
>>>>
>>>> I'm a newbie in elasticsearch, so forgive if the question is lame.
>>>>
>>>> I have implemented a custom plugin using a custom lemmatizer and a
>>>> tokenizer. The simplified class sequence:
>>>>
>>>>
>>>> AnalysisMorphologyPlugin->MorphologyAnalysisBinderProcessor->SemanticAnalyzerTwitterLemmatizerProvider->RussianLemmatizingTwitterAnalyzer
>>>>
>>>> In the RussianLemmatizingTwitterAnalyzer's ctor I load the custom object 
>>>> for lemmatization (object unrelated to lucene/es) in a singleton fashion 
>>>> (in a syncrhonized code block).
>>>> Then, when creating 14 indices in the same JVM I see
>>>>  14 instances of RussianLemmatizingTwitterAnalyzer,
>>>>  4 instances of SemanticAnalyzerTwitterLemmatizerProvider,
>>>>  4 instances of MorphologyAnalysisBinderProcessor,
>>>>  30 instances of the custom lemmatizer (in each 
>>>> RussianLemmatizingTwitterAnalyzer only one instance is expected, so should 
>>>> be 14),
>>>>  1 instance of AnalysisMorphologyPlugin.
>>>>
>>>> The question is, can RussianLemmatizingTwitterAnalyzer object be made 
>>>> shared between indices? Or is it by design, that they must load separately 
>>>> per index?
>>>> What could be wrong in the code that makes 30 instances of the custom 
>>>> singleton lemmatizer instead of 14?
>>>>
>>>> The current standing is that *with* the plugin 100M of RAM is reserved by 
>>>> the JVM with no data. *Without* the plugin the JVM reserves 2M with no 
>>>> data. Elasticsearch 1.3.2, Lucene 4.9.0.
>>>>
>>>> Regards,
>>>>
>>>> Dmitry Kan
>>>>
>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE0KbyNxQkgqtN-JCjfWseqF5gm9g9KNpaX-8hgqG%2BXVw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to