[
https://issues.apache.org/jira/browse/SOLR-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982061#action_12982061
]
Dawid Weiss commented on SOLR-2282:
-----------------------------------
I think I nailed it. I did whitebox-inspect Carrot2 code and thought it
impossible for a concurrency bug to creep in (in particular with a simple
controller), but what we didn't take into account is that Carrot2
infrastructure itself allows a scenario in which a single object instance is
bound to multiple components at runtime (and is then effectively shared in a
multi threaded context). This code happens to be in Solr's code base, not in
Carrot2. The bug happens because of the following series of events:
1) The controller in Solr itself is initialized with a single instance of "new
LuceneLanguageModelFactory()" -- this factory is then injected into all
components at runtime.
2) The base class of LuceneLanguageModelFactory is DefaultLanguageModelFactory
which has an object-local cache of stemmers and tokenizers. In Carrot2 3.4.2,
factories are component-bound anyway, so a factory can reuse its resources. In
the trunk version, this is no longer the case (factories simply create new
objects as they are requested).
3) Because of the tokenizers/stemmers cache, tokenizers and stemmers can be
used in parallel when two requests are made at the same time. I think this
should be fairly repeatable on all computers, regardless of the number of
cores/speed, it's just a matter of time. Clustering is relatively longer than
tokenization, so for two tokenizations to overlap (and screw up internal data
structures) is a rare event (and yet, as we could see, frequent enough to
manifest itself during tests).
{noformat}
// Customize the language model factory. The implementation we provide here
// is included in the code base of Solr, so that it's possible to refactor
// the Lucene APIs the factory relies on if needed.
initAttributes.put("PreprocessingPipeline.languageModelFactory",
new LuceneLanguageModelFactory());
this.controller.init(initAttributes);
{noformat}
The fix for the problem would be to:
1) upgrade to trunk/future Carrot2 version (because of different memory
management in factories),
2) pass a class instead of an instance to the initialization parameters. So
this should do:
{noformat}
// Customize the language model factory. The implementation we provide here
// is included in the code base of Solr, so that it's possible to refactor
// the Lucene APIs the factory relies on if needed.
initAttributes.put("PreprocessingPipeline.languageModelFactory",
LuceneLanguageModelFactory.class);
this.controller.init(initAttributes);
{noformat}
Works on my machine :) But I'll let Staszek review this again so that we're
sure it's really this.
> Distributed Support for Search Result Clustering
> ------------------------------------------------
>
> Key: SOLR-2282
> URL: https://issues.apache.org/jira/browse/SOLR-2282
> Project: Solr
> Issue Type: New Feature
> Components: contrib - Clustering
> Affects Versions: 1.4, 1.4.1
> Reporter: Koji Sekiguchi
> Assignee: Koji Sekiguchi
> Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: SOLR-2282-diagnostics.patch, SOLR-2282.patch,
> SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch,
> SOLR-2282_test.patch
>
>
> Brad Giaccio contributed a patch for this in SOLR-769. I'd like to
> incorporate it.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]