Yes, I use an analyzer provider. Here is the code:
[code]
public class SemanticAnalyzerTwitterLemmatizerProvider extends
AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> {
private final RussianLemmatizingTwitterAnalyzer
russianLemmatizingGenericAnalyzer;
private final Logger Log =
Logger.getLogger(SemanticAnalyzerTwitterLemmatizerProvider.class.getName());
@Inject
public SemanticAnalyzerTwitterLemmatizerProvider(Index index,
@IndexSettings Settings indexSettings,
@Assisted String name,
Settings settings) {
super(index, indexSettings, name, settings);
Log.info("called super with name=" + name);
try {
String lemmatizerConfFile = settings.get("lemmatizerConf");
boolean analyzeBest =
Boolean.parseBoolean(settings.get("analyzeBest"));
russianLemmatizingGenericAnalyzer = new
RussianLemmatizingTwitterAnalyzer(lemmatizerConfFile, analyzeBest);
} catch (IOException ioe) {
throw new ElasticsearchIllegalArgumentException("Unable to load
Russian morphology analyzer", ioe);
} catch (Exception e) {
throw new ElasticsearchIllegalArgumentException("Unable to load
Russian morphology analyzer", e);
}
}
@Override
public RussianLemmatizingTwitterAnalyzer get() {
return russianLemmatizingGenericAnalyzer;
}
}
[/code]
Would you recommend to use your approach instead of this one? Do you spot
issues in my implementation of the provider?
On Wednesday, 18 March 2015 20:52:08 UTC+2, Jörg Prante wrote:
>
> Do you use an analyzer provider?
>
> Example
>
> public class RussianLemmatizingTwitterAnalyzerProvider extends
> AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> {
>
> private final MorphAnalyzer morphAnalyzer;
>
> ...
>
> @Inject
> public RussianLemmatizingTwitterAnalyzerProvider(Index index,
> @IndexSettings Settings
> indexSettings,
> Environment environment,
> @Assisted String name,
> @Assisted Settings settings) {
> super(index, indexSettings, name, settings);
> this.morphAnalyzer = createMorphAnalyzer(environment, settings,
> ...);
> }
>
> @Override
> public RussianLemmatizingTwitterAnalyzer get() {
> return new RussianLemmatizingTwitterAnalyzer(morphAnalyzer, ...);
> }
>
> private MorphAnalyzer createMorphAnalyzer(...) {
> }
>
> }
>
>
> Only such a provider is bound to a singleton. So the analyzer provider can
> set up the analyzer configuration exactly once (with a MorphAnalyzer
> instance etc.), and with get() method, it creates analyzers as required.
>
> Jörg
>
> On Wed, Mar 18, 2015 at 6:40 PM, Dmitry Kan <[email protected]
> <javascript:>> wrote:
>
>>
>> Jörg,
>>
>> Thanks for replying!
>>
>> Here is the code of the RussianLemmatizingTwitterAnalyzer, the deepest
>> class in the simplified class sequence I have posted in the original
>> message.
>>
>> [code]
>>
>> public class RussianLemmatizingTwitterAnalyzer extends Analyzer {
>>
>> private static MorphAnalyzer morphAnalyzerGlobal;
>>
>> boolean useSyncMethod = true;
>>
>> private static final boolean verbose = false;
>> private MorphAnalyzer morphAnalyzer;
>> private boolean analyzeBest = false;
>>
>> private static final Logger Log =
>> Logger.getLogger(RussianLemmatizingTwitterAnalyzer.class.getName());
>>
>> public RussianLemmatizingTwitterAnalyzer(String lemmatizerConfFile,
>> boolean analyzeBest) throws IOException {
>> this.analyzeBest = analyzeBest;
>>
>> if (useSyncMethod) {
>> this.morphAnalyzer = loadCustomAnalyzer(lemmatizerConfFile);
>> } else {
>> Properties properties = new Properties();
>>
>> Log.info("Loading lemmatizer properties from " +
>> lemmatizerConfFile);
>>
>> properties.load(new StringReader(IOUtils.readFile(new
>> File(lemmatizerConfFile), Charsets.UTF_8)));
>> this.morphAnalyzer = MorphAnalyzerLoader.load(new
>> MorphAnalyzerConfig(properties));
>> }
>> }
>>
>> private static MorphAnalyzer loadAnalyzer(String lemmatizerConfFile)
>> throws IOException {
>> Properties properties = new Properties();
>>
>> Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);
>>
>> properties.load(new StringReader(IOUtils.readFile(new
>> File(lemmatizerConfFile), Charsets.UTF_8)));
>> MorphAnalyzer morphAnalyzer1 = MorphAnalyzerLoader.load(new
>> MorphAnalyzerConfig(properties));
>>
>> if (verbose) {
>> if (morphAnalyzer1 != null) {
>> Log.info("Successfully created the analyzer!");
>> Log.info(morphAnalyzer1.analyzeBest("билета").toString());
>> } else {
>> Log.severe("Failed to create the morphAnalyzer object");
>> }
>> }
>>
>> return morphAnalyzer1;
>> }
>>
>> public static synchronized MorphAnalyzer loadCustomAnalyzer(String
>> lemmatizerConfFile)
>> throws IOException {
>> if (morphAnalyzerGlobal == null) {
>> morphAnalyzerGlobal = loadAnalyzer(lemmatizerConfFile);
>> }
>>
>> return morphAnalyzerGlobal;
>> }
>>
>> @Override
>> protected TokenStreamComponents createComponents(String fieldName, final
>> Reader reader) {
>> Tokenizer tokenizer = new TwitterFlexLuceneTokenizer(reader);
>>
>> Log.config("Using Tokenizer: " +
>> tokenizer.getClass().getSimpleName());
>>
>> TokenStream tokenStream = tokenizer;
>> tokenStream = new LowerCaseFilter(Version.LUCENE_4_9,tokenStream);
>> tokenStream = new MorphAnalTokenFilter(tokenStream, morphAnalyzer,
>> analyzeBest);
>> return new TokenStreamComponents(tokenizer, tokenStream);
>> }
>>
>> }
>>
>>
>> [/code]
>>
>> Note, that in the code above the TwitterFlexLuceneTokenizer is not thread
>> safe and extends o.a.lucene.analysis.Tokenizer. In jvisualvm there are 97
>> instances of this class.
>>
>> Let me know, if I should copy other code snippets up the class stream.
>>
>> Dmitry
>>
>> On Wednesday, 18 March 2015 18:22:47 UTC+2, Jörg Prante wrote:
>>>
>>> Is it possible to examine the code of your plugin?
>>>
>>> Generally speaking, analyzers are instantiated per index creation for
>>> each thread.
>>>
>>> In org.elasticsearch.index.analysis.AnalysisModule, you can see how
>>> analyzer providers and factories are prepared for injection by the help of
>>> the ES injection modul which is based on Guice. Basically, the factories
>>> are kept as singletons, and each thread can pick analyzer instances from
>>> the factory when needed. All in all, Lucene analyzer classes are not
>>> threadsafe, in particular the tokenizers. It means, it is up to the
>>> implementor of an analyzer/tokenizer to store immutable objects as
>>> singletons in a correct way so that all threads can safely access them.
>>>
>>> Jörg
>>>
>>> On Wed, Mar 18, 2015 at 4:02 PM, Dmitry Kan <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> Could somebody answer, please?
>>>>
>>>>
>>>> On Tuesday, 17 March 2015 19:05:38 UTC+2, Dmitry Kan wrote:
>>>>>
>>>>> Hello!
>>>>>
>>>>> I'm a newbie in elasticsearch, so forgive if the question is lame.
>>>>>
>>>>> I have implemented a custom plugin using a custom lemmatizer and a
>>>>> tokenizer. The simplified class sequence:
>>>>>
>>>>>
>>>>> AnalysisMorphologyPlugin->MorphologyAnalysisBinderProcessor->SemanticAnalyzerTwitterLemmatizerProvider->RussianLemmatizingTwitterAnalyzer
>>>>>
>>>>> In the RussianLemmatizingTwitterAnalyzer's ctor I load the custom object
>>>>> for lemmatization (object unrelated to lucene/es) in a singleton fashion
>>>>> (in a syncrhonized code block).
>>>>> Then, when creating 14 indices in the same JVM I see
>>>>> 14 instances of RussianLemmatizingTwitterAnalyzer,
>>>>> 4 instances of SemanticAnalyzerTwitterLemmatizerProvider,
>>>>> 4 instances of MorphologyAnalysisBinderProcessor,
>>>>> 30 instances of the custom lemmatizer (in each
>>>>> RussianLemmatizingTwitterAnalyzer only one instance is expected, so
>>>>> should be 14),
>>>>> 1 instance of AnalysisMorphologyPlugin.
>>>>>
>>>>> The question is, can RussianLemmatizingTwitterAnalyzer object be made
>>>>> shared between indices? Or is it by design, that they must load
>>>>> separately per index?
>>>>> What could be wrong in the code that makes 30 instances of the custom
>>>>> singleton lemmatizer instead of 14?
>>>>>
>>>>> The current standing is that *with* the plugin 100M of RAM is reserved by
>>>>> the JVM with no data. *Without* the plugin the JVM reserves 2M with no
>>>>> data. Elasticsearch 1.3.2, Lucene 4.9.0.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Dmitry Kan
>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%
>>>> 40googlegroups.com
>>>> <https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com
>>
>> <https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b80d3251-c0b4-4081-8a6b-f585b3e3c60d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.