[ 
https://issues.apache.org/jira/browse/SOLR-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982061#action_12982061
 ] 

Dawid Weiss commented on SOLR-2282:
-----------------------------------

I think I nailed it. I did whitebox-inspect Carrot2 code and thought it 
impossible for a concurrency bug to creep in (in particular with a simple 
controller), but what we didn't take into account is that Carrot2 
infrastructure itself allows a scenario in which a single object instance is 
bound to multiple components at runtime (and is then effectively shared in a 
multi threaded context). This code happens to be in Solr's code base, not in 
Carrot2. The bug happens because of the following series of events:

1) The controller in Solr itself is initialized with a single instance of "new 
LuceneLanguageModelFactory()" -- this factory is then injected into all 
components at runtime.
2) The base class of LuceneLanguageModelFactory is DefaultLanguageModelFactory 
which has an object-local cache of stemmers and tokenizers. In Carrot2 3.4.2, 
factories are component-bound anyway, so a factory can reuse its resources. In 
the trunk version, this is no longer the case (factories simply create new 
objects as they are requested).
3) Because of the tokenizers/stemmers cache, tokenizers and stemmers can be 
used in parallel when two requests are made at the same time. I think this 
should be fairly repeatable on all computers, regardless of the number of 
cores/speed, it's just a matter of time. Clustering is relatively longer than 
tokenization, so for two tokenizations to overlap (and screw up internal data 
structures) is a rare event (and yet, as we could see, frequent enough to 
manifest itself during tests).

{noformat}
    // Customize the language model factory. The implementation we provide here
    // is included in the code base of Solr, so that it's possible to refactor
    // the Lucene APIs the factory relies on if needed.
    initAttributes.put("PreprocessingPipeline.languageModelFactory",
      new LuceneLanguageModelFactory());
    this.controller.init(initAttributes);
{noformat}

The fix for the problem would be to:

1) upgrade to trunk/future Carrot2 version (because of different memory 
management in factories),
2) pass a class instead of an instance to the initialization parameters. So 
this should do:

{noformat}
    // Customize the language model factory. The implementation we provide here
    // is included in the code base of Solr, so that it's possible to refactor
    // the Lucene APIs the factory relies on if needed.
    initAttributes.put("PreprocessingPipeline.languageModelFactory",
      LuceneLanguageModelFactory.class);
    this.controller.init(initAttributes);
{noformat}

Works on my machine :) But I'll let Staszek review this again so that we're 
sure it's really this.



> Distributed Support for Search Result Clustering
> ------------------------------------------------
>
>                 Key: SOLR-2282
>                 URL: https://issues.apache.org/jira/browse/SOLR-2282
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - Clustering
>    Affects Versions: 1.4, 1.4.1
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2282-diagnostics.patch, SOLR-2282.patch, 
> SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, 
> SOLR-2282_test.patch
>
>
> Brad Giaccio contributed a patch for this in SOLR-769. I'd like to 
> incorporate it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to