Hello,
Thanks for the reply! I've found that the problem is caused by the
commas that separate different words, if I change the commas to spaces
or semi-colons, then it works fine. Comma also works as long as you
don't have any digits in the word. Maybe it has something to do with
"10,000" or that sort?
And I have a second question that somewhat related, if I have text
"deskbar-abc" indexed, it will be indexed as "deskbar" and "abc", but if
I have "deskbar-abc288" instead, it will be treated as one word. Is
there a way to make it work consistently? For example, always keep the
dash and do not split the word?
Many thanks in advance!
Min
DIGY wrote:
1.
I tried your case with the following code and everything worked as expected.
Test(new Lucene.Net.Analysis.Standard.StandardAnalyzer(), "hello
alison20 there", "alison20");
void Test(Lucene.Net.Analysis.Analyzer analyzer, string
stringToIndex, string stringToSearch)
{
Lucene.Net.Store.RAMDirectory dir = new
Lucene.Net.Store.RAMDirectory();
Lucene.Net.Index.IndexWriter writer = new
Lucene.Net.Index.IndexWriter(dir, analyzer);
Lucene.Net.Documents.Document doc = new
Lucene.Net.Documents.Document();
Lucene.Net.Documents.Field field = new
Lucene.Net.Documents.Field("field1", stringToIndex,
Lucene.Net.Documents.Field.Store.YES,
Lucene.Net.Documents.Field.Index.TOKENIZED);
doc.Add(field);
writer.AddDocument(doc);
writer.Close();
Lucene.Net.Search.IndexSearcher searcher = new
Lucene.Net.Search.IndexSearcher(dir);
Lucene.Net.QueryParsers.QueryParser qp = new
Lucene.Net.QueryParsers.QueryParser("field1", analyzer);
Lucene.Net.Search.Query q = qp.Parse(stringToSearch);
Lucene.Net.Search.Hits hits = searcher.Search(q);
Console.WriteLine(hits.Length().ToString() + " hit(s)");
}
2.
Using StandardAnalyzer, tokens of string "hello alison20 there" are "hello"
and "alison20"( as expected ).
TokenizeString(new Lucene.Net.Analysis.Standard.StandardAnalyzer() ,
"hello alison20 there");
void TokenizeString(Lucene.Net.Analysis.Analyzer analyzer, string s)
{
Lucene.Net.Analysis.TokenStream ts = analyzer.TokenStream("",
new System.IO.StringReader(s));
for (Lucene.Net.Analysis.Token t = ts.Next(); t != null; t =
ts.Next())
{
Console.WriteLine(t.TermText() + " " + t.Type());
}
}
DIGY
-----Original Message-----
From: yin [mailto:[EMAIL PROTECTED]
Sent: Saturday, January 05, 2008 2:43 AM
To: [email protected]
Subject: Strange Indexing Problem with letter-number combination
Hello there!
I see a very strange indexing problem that I hope someone can shed a light
on.
I have a StandardAnalyzer (the default one, no special configurations), it
works great until it hits a file that contains a letter-number combination
word such as "alison29". I checked the index with Luke and here's the
strange thing:
For text "how are you", I got three index entries as "how", "are", and
"you", while as for text "hello alison20 there", I got only one index entry
as "hello,alison29,there", as a consequence, none of the searches for
"alison29", for "hello", or for "there" returns anything, it only works if I
search precisely for "hello,alison29,there".
I can pad both my index and search keyword but not very comfortable about
it, and I feel the issue is too obvious to be a overlooked bug, more likely
I missed something, perhaps some parameter setting in Lucene
StandardAnalyzer? Any idea? Thank you very much for your help!
Regards,
Min