FileWriter may garble non-ASCII output if the environment variable LANG/LC_ALL
is not appropriate.
--------------------------------------------------------------------------------------------------
Key: MAHOUT-594
URL: https://issues.apache.org/jira/browse/MAHOUT-594
Project: Mahout
Issue Type: Bug
Components: Utils
Affects Versions: 0.4
Environment: RHL Linux 2.6.18
Reporter: Shige Takeda
For non-ASCII output data, java.io.FileWriter should be replaced with
java.io.OutputStreamWriter in UTF-8.
For example, if you dump centroids of clusters using ClusterDumper, you may get
the following output:
{noformat}
...
C-0{n=2 c=[brown:2.099, c?t:1.957, dogs:1.916, fox:0.652, jumped:2.099,
l?zy:1.884, over:2.099, quick:2.099, red:1.916, ?:0.871, ?:0.871, ?:0.871,
?:0.871] r=[c?t:0.652, fox:0.652, l?zy:1.131, ?:0.871, ?:0.871, ?:0.871,
?:0.871]}
Top Terms:
quick => 2.0986123085021973
over => 2.0986123085021973
jumped => 2.0986123085021973
brown => 2.0986123085021973
c?t => 1.957078456878662
red => 1.9162907600402832
dogs => 1.9162907600402832
l?zy => 1.8843144178390503
? => 0.8706584572792053
? => 0.8706584572792053
Weight: Point:
1.0: P(0) = [brown:2.099, dogs:1.916, fox:2.609, jumped:2.099, over:2.099,
quick:2.099, red:1.916, ?:2.322, ?:2.322, ?:2.322, ?:2.322]
1.0: P(1) = [brown:2.099, dogs:1.916, fox:2.609, jumped:2.099, over:2.099,
quick:2.099, red:1.916, ?:2.322, ?:2.322, ?:2.322, ?:2.322]
1.0: P(2) = [brown:2.099, c?t:2.609, dogs:1.916, jumped:2.099, over:2.099,
quick:2.099, red:1.916, ?:2.322, ?:2.322, ?:2.322, ?:2.322]
...
{noformat}
where "?" characters were garbled by FileWriter. NOTE: this test case is a
tweaked version of TestClusterDumper. E.g., lazy => läzy
The cause of this is the line in ClusterDumper.java:
{code}
Writer writer = this.outputFile == null ? new OutputStreamWriter(System.out) :
new FileWriter(this.outputFile);
{code}
This can be around by setting the environment variables LC_ALL/LANG to
en_US.UTF-8, but many environments have LC_ALL/LANG=C by default, and in some
cases, you even may not have choices but C for various reasons.
To address this issue, I would like to propose to hard code the output encoding
to UTF-8 as follows:
{code}
Writer writer = this.outputFile == null ? new OutputStreamWriter(System.out) :
new OutputStreamWriter(new FileInputStream(this.outputFile), UTF8);
{code}
This way, the output file encoding will not be affected by environments.
And if this proposal is agreed, a similar fix should be applied to the
following files:
- ./core/src/main/java/org/apache/mahout/classifier/sgd/ModelSerializer.java
- ./core/src/test/java/org/apache/mahout/fpm/pfpgrowth/PFPGrowthTest.java
- ./examples/src/main/java/org/apache/mahout/classifier/sgd/TrainLogistic.java
-
./examples/src/main/java/org/apache/mahout/clustering/display/DisplaySpectralKMeans.java
- ./utils/src/main/java/org/apache/mahout/clustering/lda/LDAPrintTopics.java
- ./utils/src/main/java/org/apache/mahout/utils/SequenceFileDumper.java
- ./utils/src/main/java/org/apache/mahout/utils/clustering/ClusterDumper.java
- ./utils/src/main/java/org/apache/mahout/utils/vectors/VectorDumper.java
- ./utils/src/main/java/org/apache/mahout/utils/vectors/arff/Driver.java
-
./utils/src/main/java/org/apache/mahout/utils/vectors/lucene/ClusterLabels.java
- ./utils/src/main/java/org/apache/mahout/utils/vectors/lucene/Driver.java
Hope not many folks prefer ISO-8859-1 or other 'legacy' character sets.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.