CJKAnalyzer should convert half width katakana to full width katakana
---------------------------------------------------------------------
Key: LUCENE-1032
URL: https://issues.apache.org/jira/browse/LUCENE-1032
Project: Lucene - Java
Issue Type: Improvement
Reporter: Andrew Lynch
Some of our Japanese customers are reporting errors when performing searches
using half width characters.
The desired behavior is that a document containing half width characters should
be returned when performing a search using full width equivalents or when
searching by the half width character itself.
Currently, a search will not return any matches for half width characters.
Here is a test case outlining desired behavior (this may require a new
Analyzer).
{code}
public class TestJapaneseEncodings extends TestCase
{
byte[] fullWidthKa = new byte[]{(byte) 0xE3, (byte) 0x82, (byte) 0xAB};
byte[] halfWidthKa = new byte[]{(byte) 0xEF, (byte) 0xBD, (byte) 0xB6};
public void testAnalyzerWithHalfWidth() throws IOException
{
Reader r1 = new StringReader(makeHalfWidthKa());
TokenStream stream = new CJKAnalyzer().tokenStream("foo", r1);
assertNotNull(stream);
Token token = stream.next();
assertNotNull(token);
assertEquals(makeFullWidthKa(), token.termText());
}
public void testAnalyzerWithFullWidth() throws IOException
{
Reader r1 = new StringReader(makeFullWidthKa());
TokenStream stream = new CJKAnalyzer().tokenStream("foo", r1);
assertEquals(makeFullWidthKa(), stream.next().termText());
}
private String makeFullWidthKa() throws UnsupportedEncodingException
{
return new String(fullWidthKa, "UTF-8");
}
private String makeHalfWidthKa() throws UnsupportedEncodingException
{
return new String(halfWidthKa, "UTF-8");
}
}
{code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]