[ 
https://issues.apache.org/jira/browse/MAHOUT-594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986409#action_12986409
 ] 

Shige Takeda commented on MAHOUT-594:
-------------------------------------

I had one for ClusterDumper. Let me try changing all of them.


> FileWriter may garble non-ASCII output if the environment variable 
> LANG/LC_ALL is not appropriate.
> --------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-594
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-594
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.4
>         Environment: RHL Linux 2.6.18
>            Reporter: Shige Takeda
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.5
>
>
> For non-ASCII output data, java.io.FileWriter should be replaced with 
> java.io.OutputStreamWriter in UTF-8.
> For example, if you dump centroids of clusters using ClusterDumper, you may 
> get the following output:
> {noformat}
> ...
> C-0{n=2 c=[brown:2.099, c?t:1.957, dogs:1.916, fox:0.652, jumped:2.099, 
> l?zy:1.884, over:2.099, quick:2.099, red:1.916, ?:0.871, ?:0.871, ?:0.871, 
> ?:0.871] r=[c?t:0.652, fox:0.652, l?zy:1.131, ?:0.871, ?:0.871, ?:0.871, 
> ?:0.871]}
>     Top Terms:
>         quick                                   =>  2.0986123085021973
>         over                                    =>  2.0986123085021973
>         jumped                                  =>  2.0986123085021973
>         brown                                   =>  2.0986123085021973
>         c?t                                     =>   1.957078456878662
>         red                                     =>  1.9162907600402832
>         dogs                                    =>  1.9162907600402832
>         l?zy                                    =>  1.8843144178390503
>         ?                                       =>  0.8706584572792053
>         ?                                       =>  0.8706584572792053
>     Weight:  Point:
>     1.0: P(0) = [brown:2.099, dogs:1.916, fox:2.609, jumped:2.099, 
> over:2.099, quick:2.099, red:1.916, ?:2.322, ?:2.322, ?:2.322, ?:2.322]
>     1.0: P(1) = [brown:2.099, dogs:1.916, fox:2.609, jumped:2.099, 
> over:2.099, quick:2.099, red:1.916, ?:2.322, ?:2.322, ?:2.322, ?:2.322]
>     1.0: P(2) = [brown:2.099, c?t:2.609, dogs:1.916, jumped:2.099, 
> over:2.099, quick:2.099, red:1.916, ?:2.322, ?:2.322, ?:2.322, ?:2.322]
> ...
> {noformat}
> where "?" characters were garbled by FileWriter. NOTE: this test case is a 
> tweaked version of TestClusterDumper. E.g., lazy => läzy
> The cause of this is the line in ClusterDumper.java:
> {code}
> Writer writer = this.outputFile == null ? new OutputStreamWriter(System.out) 
> : new FileWriter(this.outputFile);
> {code}
> This can be around by setting the environment variables LC_ALL/LANG to 
> en_US.UTF-8, but many environments have LC_ALL/LANG=C by default, and in some 
> cases, you even may not have choices but C for various reasons.
> To address this issue, I would like to propose to hard code the output 
> encoding to UTF-8 as follows:
> {code}
> Writer writer = this.outputFile == null ? new OutputStreamWriter(System.out) 
> : new OutputStreamWriter(new FileInputStream(this.outputFile), UTF8);
> {code}
> This way, the output file encoding will not be affected by environments.
> And if this proposal is agreed, a similar fix should be applied to the 
> following files:
> - ./core/src/main/java/org/apache/mahout/classifier/sgd/ModelSerializer.java
> - ./core/src/test/java/org/apache/mahout/fpm/pfpgrowth/PFPGrowthTest.java
> - ./examples/src/main/java/org/apache/mahout/classifier/sgd/TrainLogistic.java
> - 
> ./examples/src/main/java/org/apache/mahout/clustering/display/DisplaySpectralKMeans.java
> - ./utils/src/main/java/org/apache/mahout/clustering/lda/LDAPrintTopics.java
> - ./utils/src/main/java/org/apache/mahout/utils/SequenceFileDumper.java
> - ./utils/src/main/java/org/apache/mahout/utils/clustering/ClusterDumper.java
> - ./utils/src/main/java/org/apache/mahout/utils/vectors/VectorDumper.java
> - ./utils/src/main/java/org/apache/mahout/utils/vectors/arff/Driver.java
> - 
> ./utils/src/main/java/org/apache/mahout/utils/vectors/lucene/ClusterLabels.java
> - ./utils/src/main/java/org/apache/mahout/utils/vectors/lucene/Driver.java
> Hope not many folks prefer ISO-8859-1 or other 'legacy' character sets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to