Hi Chris,

Take a look at

http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200703.mb
ox/[EMAIL PROTECTED]

that may help you. It splits tokens returning from "StandardTokenizer" and
containing " , and -" .

DIGY

-----Original Message-----
From: Chris David [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 12, 2007 12:36 AM
To: [email protected]
Subject: Lower Case ANDs and Serial Numbers

Good Afternoon Everyone,

 

I have several issues I have been trying to solve and have been getting
stuck.

The two issues I have been trying to solve are using the
StandardAnalyzer:

 - Tokenize strings that the standard grammar is considering serial
numbers, e.g.  "ABC-2007-5-22" is being stored as "ABC-2006-5-22"
instead of "ABC" "2006" "5" "22".

 - Get the analyzer to recognize mixed case "and"s as "AND"

 

On the first issue of tokenizing strings I have been looking at the
StandardAnalyzer.jj file located in the "\Lucene.Net\Analysis\Standard"
folder.  I see that this file holds the JavaCC grammar the analyzer uses
to parse tokens.  I am wondering how this file gets compiled into the C#
dll.  

 

The other issue with this file is how I can use the StandardAnalyzer.jj
to solve my first issue. From looking at the file it appears that the
"<NUM>" Grammar rule is the rule that defines a serial number as a
single token.  If I remove this from the array that defines the grammar,
will the tokenizer split the strings the way I am looking for?  Any
other ideas would be greatly appreciated.

 

On the second issue I am trying to avoid string.replace'ing the user
input query.  Hopefully there is some method in the QueryAnalyzer to
enable mixed case "and"s.

 

If this helps I am using Lucene 1.9.1 on Visual Studio 2005 and
compiling for the .NET 2.0 Framework.

 

Thanks,

 

Christopher A. David

Software Engineer

Snapstream Media 

http://www.snapstream.com <http://www.snapstream.com/> 

http://www.couchville.com <http://www.couchville.com/> 

 


Reply via email to