[ 
https://issues.apache.org/jira/browse/MAHOUT-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-594:
-----------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I committed a variant on your patch which does the same thing in a different 
way. Forgive me. I wanted to make some related changes around those lines 
anyway so found it easier to create my own rendition. There is no more use of 
FileWriter and FileReader, and that's good.

> FileWriter may garble non-ASCII output if the environment variable 
> LANG/LC_ALL is not appropriate.
> --------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-594
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-594
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.4
>         Environment: RHL Linux 2.6.18
>            Reporter: Shige Takeda
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.5
>
>         Attachments: 
> 0001-set-file-reader-and-writer-character-encoding-to-utf.patch
>
>
> For non-ASCII output data, java.io.FileWriter should be replaced with 
> java.io.OutputStreamWriter in UTF-8.
> For example, if you dump centroids of clusters using ClusterDumper, you may 
> get the following output:
> {noformat}
> ...
> C-0{n=2 c=[brown:2.099, c?t:1.957, dogs:1.916, fox:0.652, jumped:2.099, 
> l?zy:1.884, over:2.099, quick:2.099, red:1.916, ?:0.871, ?:0.871, ?:0.871, 
> ?:0.871] r=[c?t:0.652, fox:0.652, l?zy:1.131, ?:0.871, ?:0.871, ?:0.871, 
> ?:0.871]}
>     Top Terms:
>         quick                                   =>  2.0986123085021973
>         over                                    =>  2.0986123085021973
>         jumped                                  =>  2.0986123085021973
>         brown                                   =>  2.0986123085021973
>         c?t                                     =>   1.957078456878662
>         red                                     =>  1.9162907600402832
>         dogs                                    =>  1.9162907600402832
>         l?zy                                    =>  1.8843144178390503
>         ?                                       =>  0.8706584572792053
>         ?                                       =>  0.8706584572792053
>     Weight:  Point:
>     1.0: P(0) = [brown:2.099, dogs:1.916, fox:2.609, jumped:2.099, 
> over:2.099, quick:2.099, red:1.916, ?:2.322, ?:2.322, ?:2.322, ?:2.322]
>     1.0: P(1) = [brown:2.099, dogs:1.916, fox:2.609, jumped:2.099, 
> over:2.099, quick:2.099, red:1.916, ?:2.322, ?:2.322, ?:2.322, ?:2.322]
>     1.0: P(2) = [brown:2.099, c?t:2.609, dogs:1.916, jumped:2.099, 
> over:2.099, quick:2.099, red:1.916, ?:2.322, ?:2.322, ?:2.322, ?:2.322]
> ...
> {noformat}
> where "?" characters were garbled by FileWriter. NOTE: this test case is a 
> tweaked version of TestClusterDumper. E.g., lazy => läzy
> The cause of this is the line in ClusterDumper.java:
> {code}
> Writer writer = this.outputFile == null ? new OutputStreamWriter(System.out) 
> : new FileWriter(this.outputFile);
> {code}
> This can be around by setting the environment variables LC_ALL/LANG to 
> en_US.UTF-8, but many environments have LC_ALL/LANG=C by default, and in some 
> cases, you even may not have choices but C for various reasons.
> To address this issue, I would like to propose to hard code the output 
> encoding to UTF-8 as follows:
> {code}
> Writer writer = this.outputFile == null ? new OutputStreamWriter(System.out) 
> : new OutputStreamWriter(new FileInputStream(this.outputFile), UTF8);
> {code}
> This way, the output file encoding will not be affected by environments.
> And if this proposal is agreed, a similar fix should be applied to the 
> following files:
> - ./core/src/main/java/org/apache/mahout/classifier/sgd/ModelSerializer.java
> - ./core/src/test/java/org/apache/mahout/fpm/pfpgrowth/PFPGrowthTest.java
> - ./examples/src/main/java/org/apache/mahout/classifier/sgd/TrainLogistic.java
> - 
> ./examples/src/main/java/org/apache/mahout/clustering/display/DisplaySpectralKMeans.java
> - ./utils/src/main/java/org/apache/mahout/clustering/lda/LDAPrintTopics.java
> - ./utils/src/main/java/org/apache/mahout/utils/SequenceFileDumper.java
> - ./utils/src/main/java/org/apache/mahout/utils/clustering/ClusterDumper.java
> - ./utils/src/main/java/org/apache/mahout/utils/vectors/VectorDumper.java
> - ./utils/src/main/java/org/apache/mahout/utils/vectors/arff/Driver.java
> - 
> ./utils/src/main/java/org/apache/mahout/utils/vectors/lucene/ClusterLabels.java
> - ./utils/src/main/java/org/apache/mahout/utils/vectors/lucene/Driver.java
> Hope not many folks prefer ISO-8859-1 or other 'legacy' character sets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to