Jörg,
Thanks for replying!
Here is the code of the RussianLemmatizingTwitterAnalyzer, the deepest
class in the simplified class sequence I have posted in the original
message.
[code]
public class RussianLemmatizingTwitterAnalyzer extends Analyzer {
private static MorphAnalyzer morphAnalyzerGlobal;
boolean useSyncMethod = true;
private static final boolean verbose = false;
private MorphAnalyzer morphAnalyzer;
private boolean analyzeBest = false;
private static final Logger Log =
Logger.getLogger(RussianLemmatizingTwitterAnalyzer.class.getName());
public RussianLemmatizingTwitterAnalyzer(String lemmatizerConfFile, boolean
analyzeBest) throws IOException {
this.analyzeBest = analyzeBest;
if (useSyncMethod) {
this.morphAnalyzer = loadCustomAnalyzer(lemmatizerConfFile);
} else {
Properties properties = new Properties();
Log.info("Loading lemmatizer properties from " +
lemmatizerConfFile);
properties.load(new StringReader(IOUtils.readFile(new
File(lemmatizerConfFile), Charsets.UTF_8)));
this.morphAnalyzer = MorphAnalyzerLoader.load(new
MorphAnalyzerConfig(properties));
}
}
private static MorphAnalyzer loadAnalyzer(String lemmatizerConfFile) throws
IOException {
Properties properties = new Properties();
Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);
properties.load(new StringReader(IOUtils.readFile(new
File(lemmatizerConfFile), Charsets.UTF_8)));
MorphAnalyzer morphAnalyzer1 = MorphAnalyzerLoader.load(new
MorphAnalyzerConfig(properties));
if (verbose) {
if (morphAnalyzer1 != null) {
Log.info("Successfully created the analyzer!");
Log.info(morphAnalyzer1.analyzeBest("билета").toString());
} else {
Log.severe("Failed to create the morphAnalyzer object");
}
}
return morphAnalyzer1;
}
public static synchronized MorphAnalyzer loadCustomAnalyzer(String
lemmatizerConfFile)
throws IOException {
if (morphAnalyzerGlobal == null) {
morphAnalyzerGlobal = loadAnalyzer(lemmatizerConfFile);
}
return morphAnalyzerGlobal;
}
@Override
protected TokenStreamComponents createComponents(String fieldName, final
Reader reader) {
Tokenizer tokenizer = new TwitterFlexLuceneTokenizer(reader);
Log.config("Using Tokenizer: " + tokenizer.getClass().getSimpleName());
TokenStream tokenStream = tokenizer;
tokenStream = new LowerCaseFilter(Version.LUCENE_4_9,tokenStream);
tokenStream = new MorphAnalTokenFilter(tokenStream, morphAnalyzer,
analyzeBest);
return new TokenStreamComponents(tokenizer, tokenStream);
}
}
[/code]
Note, that in the code above the TwitterFlexLuceneTokenizer is not thread
safe and extends o.a.lucene.analysis.Tokenizer. In jvisualvm there are 97
instances of this class.
Let me know, if I should copy other code snippets up the class stream.
Dmitry
On Wednesday, 18 March 2015 18:22:47 UTC+2, Jörg Prante wrote:
>
> Is it possible to examine the code of your plugin?
>
> Generally speaking, analyzers are instantiated per index creation for each
> thread.
>
> In org.elasticsearch.index.analysis.AnalysisModule, you can see how
> analyzer providers and factories are prepared for injection by the help of
> the ES injection modul which is based on Guice. Basically, the factories
> are kept as singletons, and each thread can pick analyzer instances from
> the factory when needed. All in all, Lucene analyzer classes are not
> threadsafe, in particular the tokenizers. It means, it is up to the
> implementor of an analyzer/tokenizer to store immutable objects as
> singletons in a correct way so that all threads can safely access them.
>
> Jörg
>
> On Wed, Mar 18, 2015 at 4:02 PM, Dmitry Kan <[email protected]
> <javascript:>> wrote:
>
>> Hi,
>>
>> Could somebody answer, please?
>>
>>
>> On Tuesday, 17 March 2015 19:05:38 UTC+2, Dmitry Kan wrote:
>>>
>>> Hello!
>>>
>>> I'm a newbie in elasticsearch, so forgive if the question is lame.
>>>
>>> I have implemented a custom plugin using a custom lemmatizer and a
>>> tokenizer. The simplified class sequence:
>>>
>>>
>>> AnalysisMorphologyPlugin->MorphologyAnalysisBinderProcessor->SemanticAnalyzerTwitterLemmatizerProvider->RussianLemmatizingTwitterAnalyzer
>>>
>>> In the RussianLemmatizingTwitterAnalyzer's ctor I load the custom object
>>> for lemmatization (object unrelated to lucene/es) in a singleton fashion
>>> (in a syncrhonized code block).
>>> Then, when creating 14 indices in the same JVM I see
>>> 14 instances of RussianLemmatizingTwitterAnalyzer,
>>> 4 instances of SemanticAnalyzerTwitterLemmatizerProvider,
>>> 4 instances of MorphologyAnalysisBinderProcessor,
>>> 30 instances of the custom lemmatizer (in each
>>> RussianLemmatizingTwitterAnalyzer only one instance is expected, so should
>>> be 14),
>>> 1 instance of AnalysisMorphologyPlugin.
>>>
>>> The question is, can RussianLemmatizingTwitterAnalyzer object be made
>>> shared between indices? Or is it by design, that they must load separately
>>> per index?
>>> What could be wrong in the code that makes 30 instances of the custom
>>> singleton lemmatizer instead of 14?
>>>
>>> The current standing is that *with* the plugin 100M of RAM is reserved by
>>> the JVM with no data. *Without* the plugin the JVM reserves 2M with no
>>> data. Elasticsearch 1.3.2, Lucene 4.9.0.
>>>
>>> Regards,
>>>
>>> Dmitry Kan
>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com
>>
>> <https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.