My experience is that it is much better to always use methods that explicitly 
provide the charset (InputStreamReader+FileInputStream instead of FileReader, 
one-arg String.getBytes, etc.)

-Michael

-----Original Message-----
From: Bruno Abitbol [mailto:[email protected]] 
Sent: Friday, December 18, 2009 5:50 AM
To: [email protected]
Subject: Encoding Hell

Hi,

I have been playing around for two days trying to figure out an issue related 
to the default charset:


   - When I run a very dummy job which just displays the default charset on
   hadoop using the pseudo connected mode, I obtain US-ASCII. When I display
   the java property file.encoding I obtain ANSI_X3.4-1968


   - When I run the same job under Eclipse in locale mode I obtain UTF-8
   (which is the one I expect).

I use a Linux Gentoo distribution, the locale env variables are the
following:

LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-
8"
LC_ALL=en_GB.UTF-8

I have tried to set the file.encoding property to UTF-8 but it doesn't work.
Any help would be greatly appreciated.

Thank you.



--
Bruno Abitbol
[email protected]
http://www.jobomix.fr
  • Encoding Hell Bruno Abitbol
    • RE: Encoding Hell Michael Bieniosek

Reply via email to