Jörg,
Following your suggestion I refactored the code like so:
[code]
public class SemanticAnalyzerTwitterLemmatizerProvider extends
AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> {
//private final RussianLemmatizingTwitterAnalyzer
russianLemmatizingGenericAnalyzer;
private final MorphAnalyzer morphAnalyzer;
private String lemmatizerConfFile;
boolean analyzeBest;
private final Logger Log =
Logger.getLogger(SemanticAnalyzerTwitterLemmatizerProvider.class.getName());
@Inject
public SemanticAnalyzerTwitterLemmatizerProvider(Index index,
@IndexSettings Settings indexSettings,
@Assisted String name,
Settings settings) {
super(index, indexSettings, name, settings);
Log.info("called super with name=" + name);
try {
/*
String lemmatizerConfFile = settings.get("lemmatizerConf");
boolean analyzeBest =
Boolean.parseBoolean(settings.get("analyzeBest"));
russianLemmatizingGenericAnalyzer = new
RussianLemmatizingTwitterAnalyzer(lemmatizerConfFile, analyzeBest);
*/
lemmatizerConfFile = settings.get("lemmatizerConf");
morphAnalyzer = createMorphAnalyzer();
} catch (IOException ioe) {
throw new ElasticsearchIllegalArgumentException("Unable to load
Russian morphology analyzer", ioe);
} catch (Exception e) {
throw new ElasticsearchIllegalArgumentException("Unable to load
Russian morphology analyzer", e);
}
}
private MorphAnalyzer createMorphAnalyzer() throws IOException {
Log.info("start of createMorphAnalyzer()");
MorphAnalyzer morphAnalyzer1;
Properties properties = new Properties();
Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);
properties.load(new StringReader(IOUtils.readFile(new
File(lemmatizerConfFile), Charsets.UTF_8)));
morphAnalyzer1 = MorphAnalyzerLoader.load(new
MorphAnalyzerConfig(properties));
Log.info("end of createMorphAnalyzer()");
return morphAnalyzer1;
}
@Override
public RussianLemmatizingTwitterAnalyzer get() {
return new RussianLemmatizingTwitterAnalyzer(morphAnalyzer);
}
}
[/code]
Still in the logs I see the creation of MorphAnalyzer object more than
once. Probably something is still missing in the logic?
log excerpt:
[2015-03-18 22:34:06,900][INFO ][cluster.metadata ] [Soldier X]
[rustest] deleting index
Mar 18, 2015 10:34:06 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
<init>
INFO: called super with name=russian_morphology_twitter
Mar 18, 2015 10:34:06 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
createMorphAnalyzer
INFO: start of createMorphAnalyzer()
Mar 18, 2015 10:34:06 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
createMorphAnalyzer
INFO: Loading lemmatizer properties from
/Users/dmitry/projects/information_retrieval/elasticsearch-analysis-morphology-youscan/lemmatizer/lemmatizer-ru.properties
Mar 18, 2015 10:34:07 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
createMorphAnalyzer
INFO: end of createMorphAnalyzer()
[2015-03-18 22:34:07,711][INFO ][cluster.metadata ] [Soldier X]
[rustest] creating index, cause [api], shards [5]/[1], mappings []
Mar 18, 2015 10:34:07 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
<init>
INFO: called super with name=russian_morphology_twitter
Mar 18, 2015 10:34:07 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
createMorphAnalyzer
INFO: start of createMorphAnalyzer()
Mar 18, 2015 10:34:07 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
createMorphAnalyzer
INFO: Loading lemmatizer properties from
/Users/dmitry/projects/information_retrieval/elasticsearch-analysis-morphology-youscan/lemmatizer/lemmatizer-ru.properties
Mar 18, 2015 10:34:08 PM
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
createMorphAnalyzer
INFO: end of createMorphAnalyzer()
On Wednesday, 18 March 2015 21:27:12 UTC+2, Jörg Prante wrote:
>
> In the get() method of the provider, I would better try to always return a
> new analyzer instance.
>
> The configuration and setup of the analyzer could be refactored to the
> provider.
>
> Jörg
>
> On Wed, Mar 18, 2015 at 8:12 PM, Dmitry Kan <[email protected]
> <javascript:>> wrote:
>
>> Yes, I use an analyzer provider. Here is the code:
>>
>> [code]
>>
>> public class SemanticAnalyzerTwitterLemmatizerProvider extends
>> AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> {
>> private final RussianLemmatizingTwitterAnalyzer
>> russianLemmatizingGenericAnalyzer;
>>
>> private final Logger Log =
>> Logger.getLogger(SemanticAnalyzerTwitterLemmatizerProvider.class.getName());
>>
>> @Inject
>> public SemanticAnalyzerTwitterLemmatizerProvider(Index index,
>> @IndexSettings Settings indexSettings,
>> @Assisted String name,
>> Settings settings) {
>> super(index, indexSettings, name, settings);
>> Log.info("called super with name=" + name);
>> try {
>> String lemmatizerConfFile = settings.get("lemmatizerConf");
>> boolean analyzeBest =
>> Boolean.parseBoolean(settings.get("analyzeBest"));
>> russianLemmatizingGenericAnalyzer = new
>> RussianLemmatizingTwitterAnalyzer(lemmatizerConfFile, analyzeBest);
>> } catch (IOException ioe) {
>> throw new ElasticsearchIllegalArgumentException("Unable to load
>> Russian morphology analyzer", ioe);
>> } catch (Exception e) {
>> throw new ElasticsearchIllegalArgumentException("Unable to load
>> Russian morphology analyzer", e);
>> }
>> }
>>
>> @Override
>> public RussianLemmatizingTwitterAnalyzer get() {
>> return russianLemmatizingGenericAnalyzer;
>> }
>> }
>>
>>
>> [/code]
>>
>> Would you recommend to use your approach instead of this one? Do you spot
>> issues in my implementation of the provider?
>>
>> On Wednesday, 18 March 2015 20:52:08 UTC+2, Jörg Prante wrote:
>>>
>>> Do you use an analyzer provider?
>>>
>>> Example
>>>
>>> public class RussianLemmatizingTwitterAnalyzerProvider extends
>>> AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> {
>>>
>>> private final MorphAnalyzer morphAnalyzer;
>>>
>>> ...
>>>
>>> @Inject
>>> public RussianLemmatizingTwitterAnalyzerProvider(Index index,
>>> @IndexSettings Settings
>>> indexSettings,
>>> Environment environment,
>>> @Assisted String name,
>>> @Assisted Settings settings) {
>>> super(index, indexSettings, name, settings);
>>> this.morphAnalyzer = createMorphAnalyzer(environment, settings,
>>> ...);
>>> }
>>>
>>> @Override
>>> public RussianLemmatizingTwitterAnalyzer get() {
>>> return new RussianLemmatizingTwitterAnalyzer(morphAnalyzer,
>>> ...);
>>> }
>>>
>>> private MorphAnalyzer createMorphAnalyzer(...) {
>>> }
>>>
>>> }
>>>
>>>
>>> Only such a provider is bound to a singleton. So the analyzer provider
>>> can set up the analyzer configuration exactly once (with a MorphAnalyzer
>>> instance etc.), and with get() method, it creates analyzers as required.
>>>
>>> Jörg
>>>
>>> On Wed, Mar 18, 2015 at 6:40 PM, Dmitry Kan <[email protected]> wrote:
>>>
>>>>
>>>> Jörg,
>>>>
>>>> Thanks for replying!
>>>>
>>>> Here is the code of the RussianLemmatizingTwitterAnalyzer, the deepest
>>>> class in the simplified class sequence I have posted in the original
>>>> message.
>>>>
>>>> [code]
>>>>
>>>> public class RussianLemmatizingTwitterAnalyzer extends Analyzer {
>>>>
>>>> private static MorphAnalyzer morphAnalyzerGlobal;
>>>>
>>>> boolean useSyncMethod = true;
>>>>
>>>> private static final boolean verbose = false;
>>>> private MorphAnalyzer morphAnalyzer;
>>>> private boolean analyzeBest = false;
>>>>
>>>> private static final Logger Log =
>>>> Logger.getLogger(RussianLemmatizingTwitterAnalyzer.class.getName());
>>>>
>>>> public RussianLemmatizingTwitterAnalyzer(String lemmatizerConfFile,
>>>> boolean analyzeBest) throws IOException {
>>>> this.analyzeBest = analyzeBest;
>>>>
>>>> if (useSyncMethod) {
>>>> this.morphAnalyzer = loadCustomAnalyzer(lemmatizerConfFile);
>>>> } else {
>>>> Properties properties = new Properties();
>>>>
>>>> Log.info("Loading lemmatizer properties from " +
>>>> lemmatizerConfFile);
>>>>
>>>> properties.load(new StringReader(IOUtils.readFile(new
>>>> File(lemmatizerConfFile), Charsets.UTF_8)));
>>>> this.morphAnalyzer = MorphAnalyzerLoader.load(new
>>>> MorphAnalyzerConfig(properties));
>>>> }
>>>> }
>>>>
>>>> private static MorphAnalyzer loadAnalyzer(String lemmatizerConfFile)
>>>> throws IOException {
>>>> Properties properties = new Properties();
>>>>
>>>> Log.info("Loading lemmatizer properties from " +
>>>> lemmatizerConfFile);
>>>>
>>>> properties.load(new StringReader(IOUtils.readFile(new
>>>> File(lemmatizerConfFile), Charsets.UTF_8)));
>>>> MorphAnalyzer morphAnalyzer1 = MorphAnalyzerLoader.load(new
>>>> MorphAnalyzerConfig(properties));
>>>>
>>>> if (verbose) {
>>>> if (morphAnalyzer1 != null) {
>>>> Log.info("Successfully created the analyzer!");
>>>> Log.info(morphAnalyzer1.analyzeBest("билета").toString());
>>>> } else {
>>>> Log.severe("Failed to create the morphAnalyzer object");
>>>> }
>>>> }
>>>>
>>>> return morphAnalyzer1;
>>>> }
>>>>
>>>> public static synchronized MorphAnalyzer loadCustomAnalyzer(String
>>>> lemmatizerConfFile)
>>>> throws IOException {
>>>> if (morphAnalyzerGlobal == null) {
>>>> morphAnalyzerGlobal = loadAnalyzer(lemmatizerConfFile);
>>>> }
>>>>
>>>> return morphAnalyzerGlobal;
>>>> }
>>>>
>>>> @Override
>>>> protected TokenStreamComponents createComponents(String fieldName,
>>>> final Reader reader) {
>>>> Tokenizer tokenizer = new TwitterFlexLuceneTokenizer(reader);
>>>>
>>>> Log.config("Using Tokenizer: " +
>>>> tokenizer.getClass().getSimpleName());
>>>>
>>>> TokenStream tokenStream = tokenizer;
>>>> tokenStream = new LowerCaseFilter(Version.LUCENE_4_9,tokenStream);
>>>> tokenStream = new MorphAnalTokenFilter(tokenStream, morphAnalyzer,
>>>> analyzeBest);
>>>> return new TokenStreamComponents(tokenizer, tokenStream);
>>>> }
>>>>
>>>> }
>>>>
>>>>
>>>> [/code]
>>>>
>>>> Note, that in the code above the TwitterFlexLuceneTokenizer is not
>>>> thread safe and extends o.a.lucene.analysis.Tokenizer. In jvisualvm there
>>>> are 97 instances of this class.
>>>>
>>>> Let me know, if I should copy other code snippets up the class stream.
>>>>
>>>> Dmitry
>>>>
>>>> On Wednesday, 18 March 2015 18:22:47 UTC+2, Jörg Prante wrote:
>>>>>
>>>>> Is it possible to examine the code of your plugin?
>>>>>
>>>>> Generally speaking, analyzers are instantiated per index creation for
>>>>> each thread.
>>>>>
>>>>> In org.elasticsearch.index.analysis.AnalysisModule, you can see how
>>>>> analyzer providers and factories are prepared for injection by the help
>>>>> of
>>>>> the ES injection modul which is based on Guice. Basically, the factories
>>>>> are kept as singletons, and each thread can pick analyzer instances from
>>>>> the factory when needed. All in all, Lucene analyzer classes are not
>>>>> threadsafe, in particular the tokenizers. It means, it is up to the
>>>>> implementor of an analyzer/tokenizer to store immutable objects as
>>>>> singletons in a correct way so that all threads can safely access them.
>>>>>
>>>>> Jörg
>>>>>
>>>>> On Wed, Mar 18, 2015 at 4:02 PM, Dmitry Kan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Could somebody answer, please?
>>>>>>
>>>>>>
>>>>>> On Tuesday, 17 March 2015 19:05:38 UTC+2, Dmitry Kan wrote:
>>>>>>>
>>>>>>> Hello!
>>>>>>>
>>>>>>> I'm a newbie in elasticsearch, so forgive if the question is lame.
>>>>>>>
>>>>>>> I have implemented a custom plugin using a custom lemmatizer and a
>>>>>>> tokenizer. The simplified class sequence:
>>>>>>>
>>>>>>>
>>>>>>> AnalysisMorphologyPlugin->MorphologyAnalysisBinderProcessor->SemanticAnalyzerTwitterLemmatizerProvider->RussianLemmatizingTwitterAnalyzer
>>>>>>>
>>>>>>> In the RussianLemmatizingTwitterAnalyzer's ctor I load the custom
>>>>>>> object for lemmatization (object unrelated to lucene/es) in a singleton
>>>>>>> fashion (in a syncrhonized code block).
>>>>>>> Then, when creating 14 indices in the same JVM I see
>>>>>>> 14 instances of RussianLemmatizingTwitterAnalyzer,
>>>>>>> 4 instances of SemanticAnalyzerTwitterLemmatizerProvider,
>>>>>>> 4 instances of MorphologyAnalysisBinderProcessor,
>>>>>>> 30 instances of the custom lemmatizer (in each
>>>>>>> RussianLemmatizingTwitterAnalyzer only one instance is expected, so
>>>>>>> should be 14),
>>>>>>> 1 instance of AnalysisMorphologyPlugin.
>>>>>>>
>>>>>>> The question is, can RussianLemmatizingTwitterAnalyzer object be made
>>>>>>> shared between indices? Or is it by design, that they must load
>>>>>>> separately per index?
>>>>>>> What could be wrong in the code that makes 30 instances of the custom
>>>>>>> singleton lemmatizer instead of 14?
>>>>>>>
>>>>>>> The current standing is that *with* the plugin 100M of RAM is reserved
>>>>>>> by the JVM with no data. *Without* the plugin the JVM reserves 2M with
>>>>>>> no data. Elasticsearch 1.3.2, Lucene 4.9.0.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Dmitry Kan
>>>>>>>
>>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "elasticsearch" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>>> msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40goo
>>>>>> glegroups.com
>>>>>> <https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%
>>>> 40googlegroups.com
>>>> <https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/b80d3251-c0b4-4081-8a6b-f585b3e3c60d%40googlegroups.com
>>
>> <https://groups.google.com/d/msgid/elasticsearch/b80d3251-c0b4-4081-8a6b-f585b3e3c60d%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c79ac418-4129-4a3e-9227-64dd840a30cf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.