Hi All, I'm trying to index data to lucene index in unicode utf-8 format. All my search queries are of the form \uxxxx and its working fine. But the problem is in some cases, when the document[actually a webpage content] contains Numeric Character Reference[decimal], these are getting indexed as such. For example I've the following data[some telugu language data],
డాక్టర్ When I index this they get indexed as such and querying using \uxxxx format doesnot give any result. so I want to know is there any way where we can configure lucene to take care of such things by itself, or I've to convert the same to \uxxxx format[this is just replace &# with \u and replace the 4-dig number with its hex equivalent]. This manual method doesnot sound good to me. If there is any standard way to doing the same, please someone let ke know. Thank you. --KK. One question? Is it mandatory that the data to be indexed by lucene has to in \uxxxx format for unicode utf-8 encoded data?