Further to this problem, I have created a custom tokenizer but I cannot
get it loaded properly by solr.
The error stacktrace:
----------------------------
Exception in thread "main" org.apache.solr.common.SolrException:
SolrCore 'myproject' is not available due to init failure: Could not
load conf for core myproject: Plugin init failure for [schema.xml]
fieldType "myproject_text_2_sentences": Plugin init failure for
[schema.xml] analyzer/tokenizer: Error instantiating class:
'my.lucene.tokenizer.WholeSentenceTokenizerFactory'. Schema file is
D:\Work\myproject_github\myproject\solr-5.3.0\server\solr\myproject\conf\schema.xml
at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:978)
at
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:147)
at
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
at
org.apache.solr.client.solrj.SolrClient.deleteByQuery(SolrClient.java:896)
at
org.apache.solr.client.solrj.SolrClient.deleteByQuery(SolrClient.java:859)
at
org.apache.solr.client.solrj.SolrClient.deleteByQuery(SolrClient.java:874)
at my.app.Indexing.main(Indexing.java:31)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Caused by: org.apache.solr.common.SolrException: Could not load conf for
core myproject: Plugin init failure for [schema.xml] fieldType
"myproject_text_2_sentences": Plugin init failure for [schema.xml]
analyzer/tokenizer: Error instantiating class:
'my.lucene.tokenizer.WholeSentenceTokenizerFactory'. Schema file is
D:\Work\myproject_github\myproject\solr-5.3.0\server\solr\myproject\conf\schema.xml
at
org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:80)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:725)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:447)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:438)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:210)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.common.SolrException: Plugin init failure for
[schema.xml] fieldType "myproject_text_2_sentences": Plugin init failure
for [schema.xml] analyzer/tokenizer: Error instantiating class:
'my.lucene.tokenizer.WholeSentenceTokenizerFactory'. Schema file is
D:\Work\myproject_github\myproject\solr-5.3.0\server\solr\myproject\conf\schema.xml
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:596)
at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:175)
at
org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)
at
org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
at
org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:104)
at
org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:75)
... 8 more
Caused by: org.apache.solr.common.SolrException: Plugin init failure for
[schema.xml] fieldType "myproject_text_2_sentences": Plugin init failure
for [schema.xml] analyzer/tokenizer: Error instantiating class:
'my.lucene.tokenizer.WholeSentenceTokenizerFactory'
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:178)
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:489)
... 13 more
Caused by: org.apache.solr.common.SolrException: Plugin init failure for
[schema.xml] analyzer/tokenizer: Error instantiating class:
'my.lucene.tokenizer.WholeSentenceTokenizerFactory'
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:178)
at
org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:361)
at
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:104)
at
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:52)
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:152)
... 14 more
Caused by: org.apache.solr.common.SolrException: Error instantiating
class: 'my.lucene.tokenizer.WholeSentenceTokenizerFactory'
at
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:578)
at
org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:341)
at
org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:334)
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:152)
... 18 more
Caused by: java.lang.NoSuchMethodException:
my.lucene.tokenizer.WholeSentenceTokenizerFactory.<init>(java.util.Map)
at java.lang.Class.getConstructor0(Class.java:3074)
at java.lang.Class.getConstructor(Class.java:1817)
at
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:569)
... 21 more
-----------------------------------
'WholeSentenceTokenizerFactory' looks like:
---------------------
package my.lucene.tokenizer;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.util.TokenizerFactory;
import org.apache.lucene.util.AttributeFactory;
import java.text.BreakIterator;
import java.util.Map;
public class WholeSentenceTokenizerFactory extends TokenizerFactory {
/**
* Initialize this factory via a set of key-value pairs.
*
* @param args
*/
protected WholeSentenceTokenizerFactory(Map<String, String> args) {
super(args);
}
@Override
public Tokenizer create(AttributeFactory factory) {
return new WholeSentenceTokenizer(factory,
BreakIterator.getSentenceInstance());
}
}
-------------------------------------------------
'WholeSentenceTokenizer':
-------------------------------------------------
public class WholeSentenceTokenizer extends SegmentingTokenizerBase {
protected int sentenceStart, sentenceEnd;
protected boolean hasSentence;
private CharTermAttribute termAtt =
addAttribute(CharTermAttribute.class);
private OffsetAttribute offsetAtt =
addAttribute(OffsetAttribute.class);
public WholeSentenceTokenizer() {
super(BreakIterator.getSentenceInstance());
}
public WholeSentenceTokenizer(BreakIterator iterator) {
super(iterator);
}
public WholeSentenceTokenizer(AttributeFactory factory,
BreakIterator iterator) {
super(factory, iterator);
}
@Override
protected void setNextSentence(int sentenceStart, int sentenceEnd) {
this.sentenceStart = sentenceStart;
this.sentenceEnd = sentenceEnd;
hasSentence = true;
}
@Override
protected boolean incrementWord() {
if (hasSentence) {
hasSentence = false;
clearAttributes();
termAtt.copyBuffer(buffer, sentenceStart, sentenceEnd -
sentenceStart);
offsetAtt.setOffset(correctOffset(offset + sentenceStart),
correctOffset(offset + sentenceEnd));
return true;
} else {
return false;
}
}
}
-------------------------------------
Both classes are compiled to a jar, placed insdie:
/myproject_github/myproject/solr-5.3.0/contrib/myproject
And solrconfig.xml points to the jar by defining a "lib" as:
<lib dir="${solr.install.dir:../../..}/contrib/myproject" regex=".*\.jar" />
Any suggestions what have been wrong?
Many thanks
On 23/09/2015 19:08, Steve Rowe wrote:
Hi Ziqi,
Lucene has support for sentence chunking - see SegmentingTokenizerBase,
implemeented in ThaiTokenizer and HMMChineseTokenizer. There is an example in
that class’s tests that creates tokens out of individual sentences:
TestSegmentingTokenizerBase.WholeSentenceTokenizer.
However, it sounds like you only need to store the sentences, not search
against them, so I don’t think you need sentence *tokenization*.
why not simply use the JDK’s BreakIterator (or as you say OpenNLP) to do
sentence splitting and add to the doc as stored fields?
Steve
www.lucidworks.com
On Sep 23, 2015, at 11:39 AM, Ziqi Zhang <ziqi.zh...@sheffield.ac.uk> wrote:
Thanks that is understood.
My application is a bit special in the way that I need both an indexed field
with standard tokenization and an unindexed but stored field of sentences. Both
must be present for each document.
I could possibly do with PatternTokenizer, but that is of course, less accurate
than e.g., wrapping OpenNLP sentence splitter in a lucene Tokenizer.
On 23/09/2015 16:23, Doug Turnbull wrote:
Sentence recognition is usually an NLP problem. Probably best handled
outside of Solr. For example, you probably want to train and run a sentence
recognition algorithm, inject a sentence delimiter, then use that delimiter
as the basis for tokenization.
More info on sentence recognition
http://opennlp.apache.org/documentation/manual/opennlp.html
On Wed, Sep 23, 2015 at 11:18 AM, Ziqi Zhang <ziqi.zh...@sheffield.ac.uk>
wrote:
Hi
I need a special kind of 'token' which is a sentence, so I need a
tokenizer that splits texts into sentences.
I wonder if there is already such or similar implementations?
If I have to implement it myself, I suppose I need to implement a subclass
of Tokenizer. Having looked at a few existing implementations, it does not
look very straightforward how to do it. A few pointers would be highly
appreciated.
Many thanks
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
--
Ziqi Zhang
Research Associate
Department of Computer Science
University of Sheffield
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
--
Ziqi Zhang
Research Associate
Department of Computer Science
University of Sheffield
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org