Hi Paul, I agree with u "Analyzer is the magic word" Lets look it in depth and clear, I would consider three parts in the analyzer
1. Tokenization (splitting of words) 2. Stopwords removal (depends up on the language) 3. stemming of the words (depends up on the language) First to start analyze we have split the text, for example I like split the text wherever I find the following non alphabets "\s+|;|:|<|>|\^|~|=|--+|\+|\?|!|&|\$|@|\#|\'|`|"|_|\%|\*|,|\." That means I would like to split the text wherever I find space,:,;,",',<,>,?, etc.... And then we remove the stopwords and then stemming goes on. Coming my question is clear now how Lucene splits the text? only when ever it encounter the space between the words or it consider the non alphabetic characters as well. What is the whole grammar Standard analyzer has to split the words ? Madhu Madhu Satyanarayana. Panitini PASS GCA Solution Centre Pvt Ltd. 601 Aditya Trade Centre, Ameerpet, Hyderabad, India. www.pass-consulting.com -----Original Message----- From: Paul Libbrecht [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 13, 2005 3:40 PM To: java-user@lucene.apache.org Subject: Re: Spliting of words Madhu, Analyzer is the magic word here. Lucene's StandardAnalyzer has a whole grammar to split words into tokens. There are many more analyzers, most of which are language specific (e.g. based the Snowball or Porter-stemmers, see contribs or javadoc of core). For which language do wish to use that ? paul Le 13 sept. 05, à 11:45, Madhu Satyanarayana Panitini a écrit : > Hai all > > I want know the split pattern of text before indexing in Lucene, its > splits where ever there is space in between the words Or is there any > pattern in splitting the words of text document. In which program I can > find the code on the splitting of the word. > > Madhu > > Madhu Satyanarayana. Panitini > PASS GCA Solution Centre Pvt Ltd. > 601 Aditya Trade Centre, Ameerpet, > Hyderabad, India. > www.pass-consulting.com > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]