[ https://issues.apache.org/jira/browse/SOLR-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982061#action_12982061 ]
Dawid Weiss commented on SOLR-2282: ----------------------------------- I think I nailed it. I did whitebox-inspect Carrot2 code and thought it impossible for a concurrency bug to creep in (in particular with a simple controller), but what we didn't take into account is that Carrot2 infrastructure itself allows a scenario in which a single object instance is bound to multiple components at runtime (and is then effectively shared in a multi threaded context). This code happens to be in Solr's code base, not in Carrot2. The bug happens because of the following series of events: 1) The controller in Solr itself is initialized with a single instance of "new LuceneLanguageModelFactory()" -- this factory is then injected into all components at runtime. 2) The base class of LuceneLanguageModelFactory is DefaultLanguageModelFactory which has an object-local cache of stemmers and tokenizers. In Carrot2 3.4.2, factories are component-bound anyway, so a factory can reuse its resources. In the trunk version, this is no longer the case (factories simply create new objects as they are requested). 3) Because of the tokenizers/stemmers cache, tokenizers and stemmers can be used in parallel when two requests are made at the same time. I think this should be fairly repeatable on all computers, regardless of the number of cores/speed, it's just a matter of time. Clustering is relatively longer than tokenization, so for two tokenizations to overlap (and screw up internal data structures) is a rare event (and yet, as we could see, frequent enough to manifest itself during tests). {noformat} // Customize the language model factory. The implementation we provide here // is included in the code base of Solr, so that it's possible to refactor // the Lucene APIs the factory relies on if needed. initAttributes.put("PreprocessingPipeline.languageModelFactory", new LuceneLanguageModelFactory()); this.controller.init(initAttributes); {noformat} The fix for the problem would be to: 1) upgrade to trunk/future Carrot2 version (because of different memory management in factories), 2) pass a class instead of an instance to the initialization parameters. So this should do: {noformat} // Customize the language model factory. The implementation we provide here // is included in the code base of Solr, so that it's possible to refactor // the Lucene APIs the factory relies on if needed. initAttributes.put("PreprocessingPipeline.languageModelFactory", LuceneLanguageModelFactory.class); this.controller.init(initAttributes); {noformat} Works on my machine :) But I'll let Staszek review this again so that we're sure it's really this. > Distributed Support for Search Result Clustering > ------------------------------------------------ > > Key: SOLR-2282 > URL: https://issues.apache.org/jira/browse/SOLR-2282 > Project: Solr > Issue Type: New Feature > Components: contrib - Clustering > Affects Versions: 1.4, 1.4.1 > Reporter: Koji Sekiguchi > Assignee: Koji Sekiguchi > Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: SOLR-2282-diagnostics.patch, SOLR-2282.patch, > SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, > SOLR-2282_test.patch > > > Brad Giaccio contributed a patch for this in SOLR-769. I'd like to > incorporate it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org