Hello, > my dataset also seems to have a similar problem the chemical name > alpha-androstane-3, and several others exsists > in the given text, can anyone point out what is the best stratergy > to > employ so as to index > words containing - _ + to be indexed as they are and not face > being > mutilated ?
You have to use or write an Analyzer that doesn't tokenize on non-letter or other characters. > currently on my indexes the StandardAnalyzer and QueryParser break > up > alpha-androstane-3 > into TEXT:alpha -TEXT:androstane -TEXT:3 , where TEXT is the Field > to be > searched Hm, I thought we've fixed QueryParser not to do this. Are you using Lucene 1.4? Otis > If a enclose alpha-androstane-3 as a phrase "alpha-androstane-3" > then the > QueryParser > breaks is down to ABSTRACT:"alpha androstane-3" , some how the first > "-" > disapears ? > > > regards > > Rupinder > > >-----Original Message----- > >From: Marcus Rau [mailto:[EMAIL PROTECTED] > >Sent: 29 July 2004 11:48 > >To: [EMAIL PROTECTED] > >Subject: Allow non letter characters in tokens > > > > > >Hi there, > > > >my question is a pretty short one! > > > >How can I prevent Lucene from cutting out special characters (i.e. > the > >"_") during tokenization of a text? It's quite essential for me to > have > >some non letter chars in my index. > > > >Regards > >Marcus > > > > > >--------------------------------------------------------------------- > >To unsubscribe, e-mail: [EMAIL PROTECTED] > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
