[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784506#action_12784506 ]
DM Smith commented on LUCENE-2034: ---------------------------------- {quote} bq. How about splitting out the stop words to their own class? What do you mean by that? can you elaborate? {quote} There are several parts of this. * The analyzer needs to allow for user supplied stop words, possibly null. This or the default list needs to be supplied to the StopFilter. * The stop word list needs to be loaded into a set. Currently it might be a Reader, a File or a String[] array. * The WordListLoader is a helper class to construct the set from a File or Reader. StopawareAnalyzer has another helper for reading from file for fa and ar. Otherwise there is duplicated code to stuff the array into a CharArraySet. Most of the analyzers with stop words allow override with any of these and sometimes throw something else in the mix (such as non-utf8 encoded files). The code to handle these cases is somewhat repetitious. My thought is for a class, say StopWords, that knows how to read stopwords.txt as a resource loaded by a specified class loader. Something like: {code} public class StopWords { protected static final String DEFAULT_STOPFILE = "stopfile.txt"; protected static final String DEFAULT_COMMENT = "#"; private final Version matchVersion; private CharSetArray defaultStopWords; public StopWords(Version matchVersion, String stopFile, String comment, boolean ignoreCase) { this.matchVersion = matchVersion; this.ignoreCase = ignoreCase; this.stopFile = stopFile != null ? stopFile : DEFAULT_STOPFILE; this.comment = comment != null ? comment : DEFAULT_STOPFILE; } public synchronized Set<?> getDefaultStopWords() { // lazy loading if (defaultStopWords == null) { defaultStopWords = load(); } return defaultStopWords; } protected Set<?> load() { final Reader reader = new BufferedReader(new InputStreamReader(this.class.getResourceAsStream(stopFile), "UTF-8")); final CharSetArray result = new CharSetArray(matchVersion, 0, ignoreCase); try { for (String word = reader.readLine(); word != null; word = reader.readLine()) { if (!word.startsWith(comment)) { result.add(word.trim()); } } return CharSetArray.unmodifiableSet(result); } finally { reader.close(); } } } {code} I'm pretty sure that this.class resolves to the class of the actual object and not the class in which it is called (as long as it is not called within the ctor). Then in o.a.l.analysis.ar have: {code} public class ArabicStopWords extends StopWords { public ArabicStopWords(Version matchVersion) { super(matchVersion, null, null, false); } } {code} Note that the arguments to super depend on the nature of the provided stop word list. Additional code could be added to StopWords to handle resource as a Reader and as String[], but if we follow Robert's suggestion to externalize the list in a file it is not needed. > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > ------------------------------------------------------------------------- > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Simon Willnauer > Assignee: Robert Muir > Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, > LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, > LUCENE-2034.txt > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses > need to implement at least one of the methodes returning a tokenStream. When > you look at the code it appears to be almost identical if both are > implemented in the same analyzer. Each analyzer defnes the same inner class > (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his > own way of loading them or defines a large number of ctors to load stopwords > from a file, set, arrays etc.. those ctors should be removed / deprecated and > eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org