[jira] [Created] (LUCENE-4641) Fix analyzer bugs documented in TestRandomChains

Robert Muir (JIRA) Thu, 20 Dec 2012 06:29:17 -0800

Robert Muir created LUCENE-4641:
-----------------------------------

             Summary: Fix analyzer bugs documented in TestRandomChains
                 Key: LUCENE-4641
                 URL: https://issues.apache.org/jira/browse/LUCENE-4641
             Project: Lucene - Core
          Issue Type: Bug
          Components: modules/analysis
            Reporter: Robert Muir



TestRandomChains.java found a lot of bugs, some of which are hard to fix. So we 
blacklisted certain analysis components from the test.

But we really need to fix these, some of these bugs are bad, and they impact 
users with e.g. highlighting (SOLR-4137 and so on):
{noformat}
  // TODO: fix those and remove
  private static final Set<Class<?>> brokenComponents = 
Collections.newSetFromMap(new IdentityHashMap<Class<?>,Boolean>());
  static {
    // TODO: can we promote some of these to be only
    // offsets offenders?
    Collections.<Class<?>>addAll(brokenComponents,
      // TODO: fix basetokenstreamtestcase not to trip because this one has no 
CharTermAtt
      EmptyTokenizer.class,
      // doesn't actual reset itself!
      CachingTokenFilter.class,
      // doesn't consume whole stream!
      LimitTokenCountFilter.class,
      // Not broken: we forcefully add this, so we shouldn't
      // also randomly pick it:
      ValidatingTokenFilter.class,
      // NOTE: these by themselves won't cause any 'basic assertions' to fail.
      // but see https://issues.apache.org/jira/browse/LUCENE-3920, if any 
      // tokenfilter that combines words (e.g. shingles) comes after them,
      // this will create bogus offsets because their 'offsets go backwards',
      // causing shingle or whatever to make a single token with a 
      // startOffset thats > its endOffset
      // (see LUCENE-3738 for a list of other offenders here)
      // broken!
      NGramTokenizer.class,
      // broken!
      NGramTokenFilter.class,
      // broken!
      EdgeNGramTokenizer.class,
      // broken!
      EdgeNGramTokenFilter.class,
      // broken!
      WordDelimiterFilter.class,
      // broken!
      TrimFilter.class
    );
  }

  // TODO: also fix these and remove (maybe):
  // Classes that don't produce consistent graph offsets:
  private static final Set<Class<?>> brokenOffsetsComponents = 
Collections.newSetFromMap(new IdentityHashMap<Class<?>,Boolean>());
  static {
    Collections.<Class<?>>addAll(brokenOffsetsComponents,
      ReversePathHierarchyTokenizer.class,
      PathHierarchyTokenizer.class,
      HyphenationCompoundWordTokenFilter.class,
      DictionaryCompoundWordTokenFilter.class,
      // TODO: corrumpts graphs (offset consistency check):
      PositionFilter.class,
      // TODO: it seems to mess up offsets!?
      WikipediaTokenizer.class,
      // TODO: doesn't handle graph inputs
      ThaiWordFilter.class,
      // TODO: doesn't handle graph inputs
      CJKBigramFilter.class,
      // TODO: doesn't handle graph inputs (or even look at positionIncrement)
      HyphenatedWordsFilter.class,
      // LUCENE-4065: only if you pass 'false' to enablePositionIncrements!
      TypeTokenFilter.class,
      // TODO: doesn't handle graph inputs
      CommonGramsQueryFilter.class
    );
  }
{noformat}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (LUCENE-4641) Fix analyzer bugs documented in TestRandomChains

Reply via email to