Re: Strange Indexing Problem with letter-number combination

Min Yin Tue, 08 Jan 2008 16:38:27 -0800

Hello,

Thanks for the reply! I've found that the problem is caused by thecommas that separate different words, if I change the commas to spacesor semi-colons, then it works fine. Comma also works as long as youdon't have any digits in the word. Maybe it has something to do with"10,000" or that sort?

And I have a second question that somewhat related, if I have text"deskbar-abc" indexed, it will be indexed as "deskbar" and "abc", but ifI have "deskbar-abc288" instead, it will be treated as one word. Isthere a way to make it work consistently? For example, always keep thedash and do not split the word?


Many thanks in advance!
Min

DIGY wrote:

1.
I tried your case with the following code and everything worked as expected.

      Test(new Lucene.Net.Analysis.Standard.StandardAnalyzer(), "hello
alison20 there", "alison20");

        void Test(Lucene.Net.Analysis.Analyzer analyzer, string
stringToIndex, string stringToSearch)
        {
            Lucene.Net.Store.RAMDirectory dir = new
Lucene.Net.Store.RAMDirectory();
            Lucene.Net.Index.IndexWriter writer = new
Lucene.Net.Index.IndexWriter(dir, analyzer);
            Lucene.Net.Documents.Document doc = new
Lucene.Net.Documents.Document();
            Lucene.Net.Documents.Field field = new
Lucene.Net.Documents.Field("field1", stringToIndex,
Lucene.Net.Documents.Field.Store.YES,
Lucene.Net.Documents.Field.Index.TOKENIZED);
            doc.Add(field);
            writer.AddDocument(doc);
            writer.Close();

            Lucene.Net.Search.IndexSearcher searcher = new
Lucene.Net.Search.IndexSearcher(dir);
            Lucene.Net.QueryParsers.QueryParser qp = new
Lucene.Net.QueryParsers.QueryParser("field1", analyzer);
            Lucene.Net.Search.Query q = qp.Parse(stringToSearch);
            Lucene.Net.Search.Hits hits = searcher.Search(q);
            Console.WriteLine(hits.Length().ToString() + " hit(s)");
        }


2.
Using StandardAnalyzer, tokens of string "hello alison20 there" are "hello"
and "alison20"( as expected ).

        TokenizeString(new Lucene.Net.Analysis.Standard.StandardAnalyzer() ,
"hello alison20 there");

        void TokenizeString(Lucene.Net.Analysis.Analyzer analyzer, string s)
        {
            Lucene.Net.Analysis.TokenStream ts = analyzer.TokenStream("",
new System.IO.StringReader(s));
            for (Lucene.Net.Analysis.Token t = ts.Next(); t != null; t =
ts.Next())
            {
                Console.WriteLine(t.TermText() + " " + t.Type());
            }
        }


DIGY

-----Original Message-----

From: yin [mailto:[EMAIL PROTECTED]Sent: Saturday, January 05, 2008 2:43 AM

To: [email protected]
Subject: Strange Indexing Problem with letter-number combination

Hello there!

I see a very strange indexing problem that I hope someone can shed a light
on.

I have a StandardAnalyzer (the default one, no special configurations), it
works great until it hits a file that contains a letter-number combination
word such as "alison29". I checked the index with Luke and here's the
strange thing:

For text "how are you", I got three index entries as "how", "are", and
"you", while as for text "hello alison20 there", I got only one index entry
as "hello,alison29,there", as a consequence, none of the searches for
"alison29", for "hello", or for "there" returns anything, it only works if I

search precisely for "hello,alison29,there".

I can pad both my index and search keyword but not very comfortable about
it, and I feel the issue is too obvious to be a overlooked bug, more likely
I missed something, perhaps some parameter setting in Lucene
StandardAnalyzer? Any idea? Thank you very much for your help!

Regards,

Min

Re: Strange Indexing Problem with letter-number combination

Reply via email to