My experience is that it is much better to always use methods that explicitly provide the charset (InputStreamReader+FileInputStream instead of FileReader, one-arg String.getBytes, etc.)
-Michael -----Original Message----- From: Bruno Abitbol [mailto:[email protected]] Sent: Friday, December 18, 2009 5:50 AM To: [email protected] Subject: Encoding Hell Hi, I have been playing around for two days trying to figure out an issue related to the default charset: - When I run a very dummy job which just displays the default charset on hadoop using the pseudo connected mode, I obtain US-ASCII. When I display the java property file.encoding I obtain ANSI_X3.4-1968 - When I run the same job under Eclipse in locale mode I obtain UTF-8 (which is the one I expect). I use a Linux Gentoo distribution, the locale env variables are the following: LANG=en_GB.UTF-8 LC_CTYPE="en_GB.UTF-8" LC_NUMERIC="en_GB.UTF-8" LC_TIME="en_GB.UTF-8" LC_COLLATE="en_GB.UTF-8" LC_MONETARY="en_GB.UTF-8" LC_MESSAGES="en_GB.UTF-8" LC_PAPER="en_GB.UTF-8" LC_NAME="en_GB.UTF-8" LC_ADDRESS="en_GB.UTF-8" LC_TELEPHONE="en_GB.UTF-8" LC_MEASUREMENT="en_GB.UTF-8" LC_IDENTIFICATION="en_GB.UTF- 8" LC_ALL=en_GB.UTF-8 I have tried to set the file.encoding property to UTF-8 but it doesn't work. Any help would be greatly appreciated. Thank you. -- Bruno Abitbol [email protected] http://www.jobomix.fr
