Jon,
Java expects your files to be in the encoding of the Native Locale. In most cases in
the U.S., this will be English. If you want to read files in that are in a different
encoding, you have to tell Java what your encoding is, in this case, Shift JIS. See
the javadocs for java.io.InputStreamReader.
Thus, you will most likely have to alter the Lucene demo to load your document in a
different way. If you look at HTMLDocument, my guess is you can replace the
construction of the HTMLParser with something like (somewhere around line 63):
HTMLParser parser = new HTMLParser(new InputStreamReader(new FileInputStream(file),
"SJIS")); //Not sure if "SJIS" is the right moniker for Shift-Jis.
I have not tested this, but I recall do this when I first started to be able to index
my UTF-8 docs.
I am not exactly sure if SJIS is the abbreviated form for Shift-JIS. It will throw a
UnsupportedEncodinException if it is not right. In that case, you can look up in the
Java documentation to see what is supported.
Btw, do this with your original files, which are in Shift-JIS.
>>> [EMAIL PROTECTED] 07/06/04 12:53PM >>>
Hi Jon,
It sounds to me like you have a character encoding problem. The
native2ascii tool is designed to produce input for the Java compiler;
the "\u7aef" notation you're seeing is understood by Java string
interpreters to mean the corresponding hexadecimal Unicode code point.
Other Java programs, however, depending on their implementation, may not
understand this notation. Alternatively, maybe the notation is
understood, but the conversion from Shift-JIS to Java Unicode format is
not being performed properly; if you don't tell native2ascii the source
encoding, it will assume the "native" encoding for the platform--on
Windows, depending on which localized version you've got, this is likely
to be the so-called code page 1252 (ISO-8859-1 with a few
modifications). Converting from one character encoding to another with
incorrect assumptions about the source encoding can only lead to sorrow
and confusion.
I think you can use the native2ascii tool to do what you want
(untested), but it will take two passes:
1. Use native2ascii to convert your file(s) to Java Unicode format, but
tell it the source encoding:
native2ascii -encoding SJIS inputfile outputfile1
2. Tell it to convert from Java Unicode format to UTF-8:
native2ascii -reverse -encoding UTF8 outputfile1 finaloutput
Here's a web page with more information on native2ascii:
<URL:http://java.sun.com/j2se/1.4.2/docs/tooldocs/windows/native2ascii.html>
Hope it helps,
Steve Rowe
Jon Schuster wrote:
> I've gone through all of the past messages regarding the CJKAnalyzer but I
> still must be doing something wrong because my searches don't work.
>
> I'm using the IndexHTML application from the org.apache.lucene.demo package
> to do the indexing, and I've changed the analyzer to use the CJKAnalyzer.
> I've also tried with and without setting the file.encoding to Shift-JIS.
> I've tried indexing the HTML files, which contain Shift-JIS, without
> conversion to Unicode and I get assorted "Parse Aborted: Lexical error..."
> messages. I've also tried converting the Shift-JIS HTML files to Unicode by
> first running them through the native2ascii tool.
>
> When the files are converted via native2ascii, they index without errors,
> but the index appears to contain the Unicode characters as literal strings
> such as "u7aef", "u7af6", etc. Searching for an English word produces
> results that have text like "code \u5c5e\u6027".
>
> Since others have gotten Japanese indexing to work, what's the secret I'm
> missing?
>
> Thanks,
> Jon
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]