[ https://issues.apache.org/jira/browse/PYLUCENE-32?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174781#comment-14174781 ]
Alex edited comment on PYLUCENE-32 at 11/4/14 2:29 PM: ------------------------------------------------------- Resolved was (Author: alexboy): Thanks Andi. But am using pylucene version 3.6.2. I think the problem has to do with jvm instantiation caused by java-python incompatible array issues but I dont know how to solve this. Below are the java files I added to class to lucene core perhaps you will have more understanding of what the issue is: The lemmatizer: /* * Lemmatizing library for Lucene * Copyright (C) 2010 Lars Buitinck * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program. If not, see <http://www.gnu.org/licenses/>. */ package englishlemma; import java.io.*; import edu.stanford.nlp.tagger.maxent.MaxentTagger; import edu.stanford.nlp.tagger.maxent.TaggerConfig; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; /** * An analyzer that uses an {@link EnglishLemmaTokenizer}. * * @author Lars Buitinck * @version 2010.1006 */ public class EnglishLemmaAnalyzer extends Analyzer { private MaxentTagger posTagger; /** * Construct an analyzer with a tagger using the given model file. */ public EnglishLemmaAnalyzer(String posModelFile) throws Exception { this(makeTagger(posModelFile)); } /** * Construct an analyzer using the given tagger. */ public EnglishLemmaAnalyzer(MaxentTagger tagger) { posTagger = tagger; } /** * Factory method for loading a POS tagger. */ public static MaxentTagger makeTagger(String modelFile) throws Exception { TaggerConfig config = new TaggerConfig("-model", modelFile); // The final argument suppresses a "loading" message on stderr. return new MaxentTagger(modelFile, config, false); } @Override public TokenStream tokenStream(String fieldName, Reader input) { return new EnglishLemmaTokenizer(input, posTagger); } } The tokenizer for the lemmatizer: /* * Lemmatizing library for Lucene * Copyright (c) 2010-2011 Lars Buitinck * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program. If not, see <http://www.gnu.org/licenses/>. */ package englishlemma; import java.io.*; import java.util.*; import java.util.regex.*; import com.google.common.collect.Iterables; import edu.stanford.nlp.ling.*; import edu.stanford.nlp.process.Morphology; import edu.stanford.nlp.tagger.maxent.MaxentTagger; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; import org.apache.lucene.analysis.tokenattributes.TermAttribute; /** * A tokenizer that retrieves the lemmas (base forms) of English words. * Relies internally on the sentence splitter and tokenizer supplied with * the Stanford POS tagger. * * @author Lars Buitinck * @version 2011.0122 */ public class EnglishLemmaTokenizer extends TokenStream { private Iterator<TaggedWord> tagged; private PositionIncrementAttribute posIncr; private TaggedWord currentWord; private TermAttribute termAtt; private boolean lemmaNext; /** * Construct a tokenizer processing the given input and a tagger * using the given model file. */ public EnglishLemmaTokenizer(Reader input, String posModelFile) throws Exception { this(input, EnglishLemmaAnalyzer.makeTagger(posModelFile)); } /** * Construct a tokenizer processing the given input using the given tagger. */ public EnglishLemmaTokenizer(Reader input, MaxentTagger tagger) { super(); lemmaNext = false; posIncr = addAttribute(PositionIncrementAttribute.class); termAtt = addAttribute(TermAttribute.class); List<List<HasWord>> tokenized = MaxentTagger.tokenizeText(input); tagged = Iterables.concat(tagger.process(tokenized)).iterator(); } /** * Consumers use this method to advance the stream to the next token. * The token stream emits inflected forms and lemmas interleaved (form1, * lemma1, form2, lemma2, etc.), giving lemmas and their inflected forms * the same PositionAttribute. */ @Override public final boolean incrementToken() throws IOException { if (lemmaNext) { // Emit a lemma posIncr.setPositionIncrement(1); String tag = currentWord.tag(); String form = currentWord.word(); termAtt.setTermBuffer(Morphology.stemStatic(form, tag).word()); } else { // Emit inflected form, if not filtered out. // 0 because the lemma will come in the same position int increment = 0; for (;;) { if (!tagged.hasNext()) return false; currentWord = tagged.next(); if (!unwantedPOS(currentWord.tag())) break; increment++; } posIncr.setPositionIncrement(increment); termAtt.setTermBuffer(currentWord.word()); } lemmaNext = !lemmaNext; return true; } private static final Pattern unwantedPosRE = Pattern.compile( "^(CC|DT|[LR]RB|MD|POS|PRP|UH|WDT|WP|WP\\$|WRB|\\$|\\#|\\.|\\,|:)$" ); /** * Determines if words with a given POS tag should be omitted from the * index. Defaults to filtering out punctuation and function words * (pronouns, prepositions, "the", "a", etc.). * * @see <a href="http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQP-HTMLDemo/PennTreebankTS.html">The Penn Treebank tag set</a> used by Stanford NLP */ protected boolean unwantedPOS(String tag) { return unwantedPosRE.matcher(tag).matches(); } } Meanwhile the tokenizer uses and depends on google guava array while the lemmatizer uses and depends on stanford pos tagger. Thanks. > pylucene CharArraySet jvm error > ------------------------------- > > Key: PYLUCENE-32 > URL: https://issues.apache.org/jira/browse/PYLUCENE-32 > Project: PyLucene > Issue Type: Question > Environment: I added a customized lucene analyzer class to lucene > core in Pylucene. This class is google guava as a dependency because of the > array handling function available in com.google.common.collect.Iterables in > guava. When I tried to index using this analyzer, I got the following error: > Traceback (most recent call last): File "C:\IndexFiles.py", line 78, in > lucene.initVM() JavaError: java.lang.NoClassDefFoundError: > org/apache/lucene/analysis/CharArraySet Java stacktrace: > java.lang.NoClassDefFoundError: org/apache/lucene/analysis/CharArraySet > Caused by: java.lang.ClassNotFoundException: > org.apache.lucene.analysis.CharArraySet at > java.net.URLClassLoader$1.run(URLClassLoader.java:366) at > java.net.URLClassLoader$1.run(URLClassLoader.java:355) at > java.security.AccessController.doPrivileged(Native Method) at > java.net.URLClassLoader.findClass(URLClassLoader.java:354) at > java.lang.ClassLoader.loadClass(ClassLoader.java:425) at > sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at > java.lang.ClassLoader.loadClass(ClassLoader.java:358) > Even the example indexing code in Lucene in Action that I tried earlier and > worked, when I retried it after adding this class is returning the same error > above. Am not too familiar with CharArraySet class as I can see the problem > is from it. How do i handle this? Thanks > Reporter: Alex > -- This message was sent by Atlassian JIRA (v6.3.4#6332)