[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

DM Smith (JIRA) Tue, 01 Dec 2009 15:03:44 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784506#action_12784506
 ]


DM Smith commented on LUCENE-2034:
----------------------------------

{quote}
bq.    How about splitting out the stop words to their own class? 

What do you mean by that? can you elaborate?
{quote}

There are several parts of this.
* The analyzer needs to allow for user supplied stop words, possibly null. This 
or the default list needs to be supplied to the StopFilter.
* The stop word list needs to be loaded into a set. Currently it might be a 
Reader, a File or a String[] array.
* The WordListLoader is a helper class to construct the set from a File or 
Reader. StopawareAnalyzer has another helper for reading from file for fa and 
ar. Otherwise there is duplicated code to stuff the array into a CharArraySet.
 
Most of the analyzers with stop words allow override with any of these and 
sometimes throw something else in the mix (such as non-utf8 encoded files).

The code to handle these cases is somewhat repetitious.

My thought is for a class, say StopWords, that knows how to read stopwords.txt 
as a resource loaded by a specified class loader. Something like:
{code}
public class StopWords {

  protected static final String       DEFAULT_STOPFILE = "stopfile.txt";
  protected static final String       DEFAULT_COMMENT  = "#";
  private          final Version      matchVersion;
  private                CharSetArray defaultStopWords;

  public StopWords(Version matchVersion, String stopFile, String comment, 
boolean ignoreCase) {
    this.matchVersion = matchVersion;
    this.ignoreCase   = ignoreCase;
    this.stopFile     = stopFile != null ? stopFile : DEFAULT_STOPFILE;
    this.comment      = comment  != null ? comment  : DEFAULT_STOPFILE;
  }

  public synchronized Set<?> getDefaultStopWords() {
    // lazy loading
    if (defaultStopWords == null) {
      defaultStopWords = load();
    }

    return defaultStopWords;
  }

  protected Set<?> load() {
    final Reader reader = new BufferedReader(new 
InputStreamReader(this.class.getResourceAsStream(stopFile), "UTF-8"));
    final CharSetArray result = new CharSetArray(matchVersion, 0, ignoreCase);
    try {
      for (String word = reader.readLine(); word != null; word = 
reader.readLine()) {
        if (!word.startsWith(comment)) {
          result.add(word.trim());
        }
      }
      return CharSetArray.unmodifiableSet(result);
    } finally {
      reader.close();
    }
  }

}
{code}
I'm pretty sure that this.class resolves to the class of the actual object and 
not the class in which it is called (as long as it is not called within the 
ctor).

Then in o.a.l.analysis.ar have:
{code}
public class ArabicStopWords extends StopWords {
  public ArabicStopWords(Version matchVersion) {
      super(matchVersion, null, null, false);
  }
}
{code}
Note that the arguments to super depend on the nature of the provided stop word 
list.

Additional code could be added to StopWords to handle resource as a Reader and 
as String[], but if we follow Robert's suggestion to externalize the list in a 
file it is not needed.

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-2034
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2034
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, 
> LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, 
> LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses 
> need to implement at least one of the methodes returning a tokenStream. When 
> you look at the code it appears to be almost identical if both are 
> implemented in the same analyzer.  Each analyzer defnes the same inner class 
> (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his 
> own way of loading them or defines a large number of ctors to load stopwords 
> from a file, set, arrays etc.. those ctors should be removed / deprecated and 
> eventually removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

Reply via email to