Do you use an analyzer provider?
Example
public class RussianLemmatizingTwitterAnalyzerProvider extends
AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> {
private final MorphAnalyzer morphAnalyzer;
...
@Inject
public RussianLemmatizingTwitterAnalyzerProvider(Index index,
@IndexSettings Settings
indexSettings,
Environment environment,
@Assisted String name, @Assisted
Settings settings) {
super(index, indexSettings, name, settings);
this.morphAnalyzer = createMorphAnalyzer(environment, settings,
...);
}
@Override
public RussianLemmatizingTwitterAnalyzer get() {
return new RussianLemmatizingTwitterAnalyzer(morphAnalyzer, ...);
}
private MorphAnalyzer createMorphAnalyzer(...) {
}
}
Only such a provider is bound to a singleton. So the analyzer provider can
set up the analyzer configuration exactly once (with a MorphAnalyzer
instance etc.), and with get() method, it creates analyzers as required.
Jörg
On Wed, Mar 18, 2015 at 6:40 PM, Dmitry Kan <[email protected]> wrote:
>
> Jörg,
>
> Thanks for replying!
>
> Here is the code of the RussianLemmatizingTwitterAnalyzer, the deepest
> class in the simplified class sequence I have posted in the original
> message.
>
> [code]
>
> public class RussianLemmatizingTwitterAnalyzer extends Analyzer {
>
> private static MorphAnalyzer morphAnalyzerGlobal;
>
> boolean useSyncMethod = true;
>
> private static final boolean verbose = false;
> private MorphAnalyzer morphAnalyzer;
> private boolean analyzeBest = false;
>
> private static final Logger Log =
> Logger.getLogger(RussianLemmatizingTwitterAnalyzer.class.getName());
>
> public RussianLemmatizingTwitterAnalyzer(String lemmatizerConfFile,
> boolean analyzeBest) throws IOException {
> this.analyzeBest = analyzeBest;
>
> if (useSyncMethod) {
> this.morphAnalyzer = loadCustomAnalyzer(lemmatizerConfFile);
> } else {
> Properties properties = new Properties();
>
> Log.info("Loading lemmatizer properties from " +
> lemmatizerConfFile);
>
> properties.load(new StringReader(IOUtils.readFile(new
> File(lemmatizerConfFile), Charsets.UTF_8)));
> this.morphAnalyzer = MorphAnalyzerLoader.load(new
> MorphAnalyzerConfig(properties));
> }
> }
>
> private static MorphAnalyzer loadAnalyzer(String lemmatizerConfFile)
> throws IOException {
> Properties properties = new Properties();
>
> Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);
>
> properties.load(new StringReader(IOUtils.readFile(new
> File(lemmatizerConfFile), Charsets.UTF_8)));
> MorphAnalyzer morphAnalyzer1 = MorphAnalyzerLoader.load(new
> MorphAnalyzerConfig(properties));
>
> if (verbose) {
> if (morphAnalyzer1 != null) {
> Log.info("Successfully created the analyzer!");
> Log.info(morphAnalyzer1.analyzeBest("билета").toString());
> } else {
> Log.severe("Failed to create the morphAnalyzer object");
> }
> }
>
> return morphAnalyzer1;
> }
>
> public static synchronized MorphAnalyzer loadCustomAnalyzer(String
> lemmatizerConfFile)
> throws IOException {
> if (morphAnalyzerGlobal == null) {
> morphAnalyzerGlobal = loadAnalyzer(lemmatizerConfFile);
> }
>
> return morphAnalyzerGlobal;
> }
>
> @Override
> protected TokenStreamComponents createComponents(String fieldName, final
> Reader reader) {
> Tokenizer tokenizer = new TwitterFlexLuceneTokenizer(reader);
>
> Log.config("Using Tokenizer: " +
> tokenizer.getClass().getSimpleName());
>
> TokenStream tokenStream = tokenizer;
> tokenStream = new LowerCaseFilter(Version.LUCENE_4_9,tokenStream);
> tokenStream = new MorphAnalTokenFilter(tokenStream, morphAnalyzer,
> analyzeBest);
> return new TokenStreamComponents(tokenizer, tokenStream);
> }
>
> }
>
>
> [/code]
>
> Note, that in the code above the TwitterFlexLuceneTokenizer is not thread
> safe and extends o.a.lucene.analysis.Tokenizer. In jvisualvm there are 97
> instances of this class.
>
> Let me know, if I should copy other code snippets up the class stream.
>
> Dmitry
>
> On Wednesday, 18 March 2015 18:22:47 UTC+2, Jörg Prante wrote:
>>
>> Is it possible to examine the code of your plugin?
>>
>> Generally speaking, analyzers are instantiated per index creation for
>> each thread.
>>
>> In org.elasticsearch.index.analysis.AnalysisModule, you can see how
>> analyzer providers and factories are prepared for injection by the help of
>> the ES injection modul which is based on Guice. Basically, the factories
>> are kept as singletons, and each thread can pick analyzer instances from
>> the factory when needed. All in all, Lucene analyzer classes are not
>> threadsafe, in particular the tokenizers. It means, it is up to the
>> implementor of an analyzer/tokenizer to store immutable objects as
>> singletons in a correct way so that all threads can safely access them.
>>
>> Jörg
>>
>> On Wed, Mar 18, 2015 at 4:02 PM, Dmitry Kan <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> Could somebody answer, please?
>>>
>>>
>>> On Tuesday, 17 March 2015 19:05:38 UTC+2, Dmitry Kan wrote:
>>>>
>>>> Hello!
>>>>
>>>> I'm a newbie in elasticsearch, so forgive if the question is lame.
>>>>
>>>> I have implemented a custom plugin using a custom lemmatizer and a
>>>> tokenizer. The simplified class sequence:
>>>>
>>>>
>>>> AnalysisMorphologyPlugin->MorphologyAnalysisBinderProcessor->SemanticAnalyzerTwitterLemmatizerProvider->RussianLemmatizingTwitterAnalyzer
>>>>
>>>> In the RussianLemmatizingTwitterAnalyzer's ctor I load the custom object
>>>> for lemmatization (object unrelated to lucene/es) in a singleton fashion
>>>> (in a syncrhonized code block).
>>>> Then, when creating 14 indices in the same JVM I see
>>>> 14 instances of RussianLemmatizingTwitterAnalyzer,
>>>> 4 instances of SemanticAnalyzerTwitterLemmatizerProvider,
>>>> 4 instances of MorphologyAnalysisBinderProcessor,
>>>> 30 instances of the custom lemmatizer (in each
>>>> RussianLemmatizingTwitterAnalyzer only one instance is expected, so should
>>>> be 14),
>>>> 1 instance of AnalysisMorphologyPlugin.
>>>>
>>>> The question is, can RussianLemmatizingTwitterAnalyzer object be made
>>>> shared between indices? Or is it by design, that they must load separately
>>>> per index?
>>>> What could be wrong in the code that makes 30 instances of the custom
>>>> singleton lemmatizer instead of 14?
>>>>
>>>> The current standing is that *with* the plugin 100M of RAM is reserved by
>>>> the JVM with no data. *Without* the plugin the JVM reserves 2M with no
>>>> data. Elasticsearch 1.3.2, Lucene 4.9.0.
>>>>
>>>> Regards,
>>>>
>>>> Dmitry Kan
>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE0KbyNxQkgqtN-JCjfWseqF5gm9g9KNpaX-8hgqG%2BXVw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.