How to post date encoded in NCR(decimal) to lucene indexer?

KK Mon, 01 Jun 2009 01:49:44 -0700

Hi All,
I'm trying to index data to lucene index in unicode utf-8 format. All my
search queries are of the form \uxxxx and its working fine. But the problem
is in some cases, when the document[actually a webpage content] contains
Numeric Character Reference[decimal], these are getting indexed as such. For
example I've the following data[some telugu language data],


&#3105;&#3134;&#3093;&#3149;&#3103;&#3120;&#3149;

When I index this they get indexed as such and querying using \uxxxx
format doesnot give any result. so I want to know is there any way
where we can configure lucene to take
care of such things by itself, or I've to convert the same to \uxxxx
format[this is just replace &# with \u and replace the 4-dig number
with its hex equivalent]. This manual

method doesnot sound good to me. If there is any standard way to doing
the same, please someone let ke know. Thank you.

--KK.

One question?
Is it mandatory that the data to be indexed by lucene has to in \uxxxx
format for unicode utf-8 encoded data?

How to post date encoded in NCR(decimal) to lucene indexer?

Reply via email to