Re: [jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

Mark Lassau Wed, 03 Sep 2008 19:18:33 -0700

Grant Ingersoll (JIRA) wrote:

Of course, it's still a bit weird, b/c in your case the type value is going to 
be set to ACRONYM, when your example is clearly not one.  This suggests to me 
that the grammar needs to be revisited, but that can wait until 3.0 I believe.

Grant, not sure what you mean by "b/c in your case the type value isgoing to be set to ACRONYM, when your example is clearly not one."

Once we set replaceInvalidAcronym=true, then the type is set to HOST.

However, if you were to revisit the grammar, then I would be interestedto get in on the discussion on the behaviour of <HOST>.For instance, if you have a document like "visit www.apache.org", youcurrently won't get a hit if you search for "apache".In an issue tracker like JIRA, we want to be able to search for"NullPointerException", and get a hit for the document "Applicationthrew java.lang.NullPointerException".

Also note that the current implementation has problems if the documentdoesn't contain expected whitespace.

eg "I like Apache.They rock"
Will get tokenized to the following:
I                         <ALPHANUM>
like                    <ALPHANUM>
Apache.They    <HOST>
rock                   <ALPHANUM>

I don't think there is a simple one-size-fits-all answer to how thisshould behave. It depends on the context of the app that is using Lucene.The best answer may be to make some of the behaviour configurable, orhave a suite of specific analyzers?


Mark.

Most of the contributed Analyzers suffer from invalid recognition of acronyms.
------------------------------------------------------------------------------

                Key: LUCENE-1373
                URL: https://issues.apache.org/jira/browse/LUCENE-1373
            Project: Lucene - Java
         Issue Type: Bug
         Components: Analysis, contrib/analyzers
   Affects Versions: 2.3.2
           Reporter: Mark Lassau
           Priority: Minor

LUCENE-1068 describes a bug in StandardTokenizer whereby a string like 
"www.apache.org." would be incorrectly tokenized as an acronym (note the dot at 
the end).
Unfortunately, keeping the "backward compatibility" of a bug turns out to harm 
us.
StandardTokenizer has a couple of ways to indicate "fix this bug", but 
unfortunately the default behaviour is still to be buggy.
Most of the non-English analyzers provided in lucene-analyzers utilize the 
StandardTokenizer, and in v2.3.2 not one of these provides a way to get the 
non-buggy behaviour :(
I refer to:
* BrazilianAnalyzer
* CzechAnalyzer
* DutchAnalyzer
* FrenchAnalyzer
* GermanAnalyzer
* GreekAnalyzer
* ThaiAnalyzer



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

Reply via email to