Hi all my dataset also seems to have a similar problem the chemical name alpha-androstane-3, and several others exsists in the given text, can anyone point out what is the best stratergy to employ so as to index words containing - _ + to be indexed as they are and not face being mutilated ?
currently on my indexes the StandardAnalyzer and QueryParser break up alpha-androstane-3 into TEXT:alpha -TEXT:androstane -TEXT:3 , where TEXT is the Field to be searched If a enclose alpha-androstane-3 as a phrase "alpha-androstane-3" then the QueryParser breaks is down to ABSTRACT:"alpha androstane-3" , some how the first "-" disapears ? regards Rupinder >-----Original Message----- >From: Marcus Rau [mailto:[EMAIL PROTECTED] >Sent: 29 July 2004 11:48 >To: [EMAIL PROTECTED] >Subject: Allow non letter characters in tokens > > >Hi there, > >my question is a pretty short one! > >How can I prevent Lucene from cutting out special characters (i.e. the >"_") during tokenization of a text? It's quite essential for me to have >some non letter chars in my index. > >Regards >Marcus > > >--------------------------------------------------------------------- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
