Re: Shall we change our file.encoding

Nathan Beyer Thu, 16 Jul 2009 20:17:43 -0700

On Thu, Jul 16, 2009 at 9:30 PM, Charles Lee<[email protected]> wrote:
> Thanks Nathan!
>
> I will try this :-)


Where do we define the user's locale and system locale? It seems like
all of this should be located there and associated with that process.

>
> On Fri, Jul 17, 2009 at 10:05 AM, Nathan Beyer <[email protected]> wrote:
>
>> On Thu, Jul 16, 2009 at 8:50 PM, Nathan Beyer<[email protected]> wrote:
>> > On Thu, Jul 16, 2009 at 8:35 PM, Nathan Beyer<[email protected]> wrote:
>> >> On Thu, Jul 16, 2009 at 8:26 PM, Nathan Beyer<[email protected]>
>> wrote:
>> >>> On Thu, Jul 16, 2009 at 8:18 PM, Charles Lee<[email protected]>
>> wrote:
>> >>>> Hi Nathan,
>> >>>>
>> >>>> What I got is 936, the code page identifier. Is there a api for us to
>> map
>> >>>> 936 to the gb2312?
>> >>>
>> >>> Oh, the 'identifier' bit was missing - yeah, we'll need to translate
>> >>> that into a name of some sort. I'll poke around a bit and see what I
>> >>> can find.
>> >>
>> >> We'll probably just have to put in a mapping ourselves based on the
>> >> documentation. We'd call GetACP [1] and map that to a known alias in
>> >> java.nio.charset that matches the definitions[2] of the identifiers.
>> >>
>> >> [1] http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx
>> >> [2] http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx
>> >
>> > This may be better - APR has a function for getting the OS default
>> > encoding. This would work across all platforms that APR supports and I
>> > believe we already use APR.
>> >
>> >
>> http://apr.apache.org/docs/apr/1.3/group__apr__portabile.html#g6e21845a4a5f3b7dd107b2beea50c91e
>>
>> However, the Windows version of this is simply - return
>> apr_psprintf(pool, "CP%u", (unsigned) GetACP());. Which is essentially
>> "CP" + codePageId.
>>
>> And the Unix version of this method doesn't look very good for our
>> purposes.
>> >
>> > -Nathan
>> >>
>> >>>
>> >>>> If we put 936 in the file.encoding, can we successfully get the
>> encoder and
>> >>>> decoder by charset?
>> >>>>
>> >>>> On Fri, Jul 17, 2009 at 9:05 AM, Nathan Beyer <[email protected]>
>> wrote:
>> >>>>
>> >>>>> On Thu, Jul 16, 2009 at 1:28 AM, Charles Lee<[email protected]>
>> wrote:
>> >>>>> > Hi guys,
>> >>>>> >
>> >>>>> > I have add the locale function in the drlvm, the patch is attached.
>> >>>>> Please
>> >>>>> > try this new patch on the linux.
>> >>>>> >
>> >>>>> > The patch should work on the linux but fail on the windows. Because
>> >>>>> windows
>> >>>>> > returns code page not charset from the setlocale.
>> >>>>>
>> >>>>> Code page and character set are the same thing. We shouldn't need to
>> >>>>> convert it as the Charset APIs will have to support the values
>> anyway.
>> >>>>>
>> >>>>> What's the value you're getting? If it's 'Cp1252', then we're good,
>> as
>> >>>>> that's just an alias for 'Windows-1252' (or vice-versa).
>> >>>>>
>> >>>>> -Nathan
>> >>>>>
>> >>>>>
>> >>>>> > I hv tried long time to
>> >>>>> > get the charset name from the codepage, for example:
>> >>>>> > CPINFOEX cpInfoEx;
>> >>>>> > BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx);
>> >>>>> > if (iReturn > 0) {
>> >>>>> >     printf("FULL NAME %s\n", cPinfoEx,CodePageName);
>> >>>>> > }
>> >>>>> > But I only get the full name without any format.
>> >>>>> >
>> >>>>> > There is code page identifiers map in the msdn, detail here. I may
>> hard
>> >>>>> code
>> >>>>> > this map in the file. But the note on the msdn says:
>> >>>>> >      "ANSI code pages can be different on different computers, or
>> can be
>> >>>>> > changed for a single computer, leading to data corruption. For the
>> most
>> >>>>> > consistent results, applications should use Unicode, such as UTF-8
>> or
>> >>>>> > UTF-16, instead of a specific code page."
>> >>>>> > I am afraid hard-code will fail on some machines. (By the way, this
>> seems
>> >>>>> > the UTF-8 is suggested to be the default again :-)
>> >>>>> >
>> >>>>> > There is also a class Encoding in the VC++, detail here. But we can
>> not
>> >>>>> use
>> >>>>> > it here.
>> >>>>> >
>> >>>>> > So anyone knows some thing about locale on the windows?
>> >>>>> > Again, shall use UTF-8 as our default?
>> >>>>> >
>> >>>>> > On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee <
>> [email protected]>
>> >>>>> wrote:
>> >>>>> >>
>> >>>>> >> That seems we should add it in the drlvm.
>> >>>>> >>
>> >>>>> >> On Wed, Jul 15, 2009 at 1:58 PM, Regis <[email protected]>
>> wrote:
>> >>>>> >>>
>> >>>>> >>> Nathan Beyer wrote:
>> >>>>> >>>>
>> >>>>> >>>> Is the IBM VME dealing with this correctly? Do we just need to
>> fix
>> >>>>> >>>> DRLVM?
>> >>>>> >>>
>> >>>>> >>> Yes, I only tested on Linux, IBM VME set the property correctly.
>> >>>>> >>>
>> >>>>> >>>>
>> >>>>> >>>> On Wed, Jul 15, 2009 at 12:25 AM, Regis<[email protected]>
>> wrote:
>> >>>>> >>>>>
>> >>>>> >>>>> Kevin Zhou wrote:
>> >>>>> >>>>>>
>> >>>>> >>>>>> Yea, from luniglob.c, CL attempts to read the "file.encoding"
>> >>>>> property
>> >>>>> >>>>>> adown
>> >>>>> >>>>>> VM but fails to get the correct encoding.
>> >>>>> >>>>>>
>> >>>>> >>>>>> Regis, do you know any other specific ways that CL can gain
>> the
>> >>>>> right
>> >>>>> >>>>>> property?
>> >>>>> >>>>>
>> >>>>> >>>>> We can get from OS directly. Maybe just read env variables on
>> Linux?
>> >>>>> >>>>>
>> >>>>> >>>>>> Wed, Jul 15, 2009 at 9:59 AM, Regis <[email protected]>
>> wrote:
>> >>>>> >>>>>>
>> >>>>> >>>>>>> Charles Lee wrote:
>> >>>>> >>>>>>>
>> >>>>> >>>>>>>> Hi Nanthan,
>> >>>>> >>>>>>>>
>> >>>>> >>>>>>>> If the file encoding derive from the OS, it should be the
>> some
>> >>>>> bugs
>> >>>>> >>>>>>>> in
>> >>>>> >>>>>>>> it
>> >>>>> >>>>>>>> because on my LINUX machine the locale is en_US.UTF-8. Our
>> default
>> >>>>> >>>>>>>> codec
>> >>>>> >>>>>>>> is
>> >>>>> >>>>>>>> still ISO8859-1. Do you know where can we found such codes?
>> >>>>> >>>>>>>>
>> >>>>> >>>>>>> Classlib expected vm do this and set the property, but it
>> didn't,
>> >>>>> so
>> >>>>> >>>>>>> we
>> >>>>> >>>>>>> have to do this by ourselves.
>> >>>>> >>>>>>>
>> >>>>> >>>>>>>
>> >>>>> >>>>>>>
>> >>>>> >>>>>>>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <
>> [email protected]>
>> >>>>> >>>>>>>> wrote:
>> >>>>> >>>>>>>>
>> >>>>> >>>>>>>>  Are we talking about windows or linux?the default file
>> encoding
>> >>>>> >>>>>>>> should
>> >>>>> >>>>>>>>>
>> >>>>> >>>>>>>>> derive from the OS. I believe that's defined by the specs.
>> >>>>> >>>>>>>>>
>> >>>>> >>>>>>>>> Sent from my iPhone
>> >>>>> >>>>>>>>>
>> >>>>> >>>>>>>>>
>> >>>>> >>>>>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <
>> [email protected]>
>> >>>>> >>>>>>>>> wrote:
>> >>>>> >>>>>>>>>
>> >>>>> >>>>>>>>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv
>> >>>>> >>>>>>>>> <[email protected]>
>> >>>>> >>>>>>>>>
>> >>>>> >>>>>>>>>> wrote:
>> >>>>> >>>>>>>>>>
>> >>>>> >>>>>>>>>>  Hi,
>> >>>>> >>>>>>>>>>
>> >>>>> >>>>>>>>>>>  Charles, I believe UTF-8 is the default encoding for RI,
>> and
>> >>>>> it
>> >>>>> >>>>>>>>>>> sounds
>> >>>>> >>>>>>>>>>> reasonable.
>> >>>>> >>>>>>>>>>>  BTW, it may encounter some compatibility problem, maybe
>> we
>> >>>>> need
>> >>>>> >>>>>>>>>>> to
>> >>>>> >>>>>>>>>>> run
>> >>>>> >>>>>>>>>>> more tests to verify?
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>> 2009/7/14 Charles Lee <[email protected]>
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>>  Hi guys:
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>>> I am doing some test cases on the ant junit test case
>> and
>> >>>>> >>>>>>>>>>>> meeting
>> >>>>> >>>>>>>>>>>> some
>> >>>>> >>>>>>>>>>>> encoding problems. I find they are maybe caused by the
>> >>>>> different
>> >>>>> >>>>>>>>>>>> default
>> >>>>> >>>>>>>>>>>> encoding from RI and harmony. My local is en_US.UTF-8,
>> RI
>> >>>>> >>>>>>>>>>>> default is
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>>  UTF-8
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>>  but harmony is 8859-1. And then I have encountered
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>> HARMONY-3736<
>> >>>>> https://issues.apache.org/jira/browse/HARMONY-3736>,
>> >>>>> >>>>>>>>>>>> and the two diffs attached on that issue. It seems we
>> always
>> >>>>> get
>> >>>>> >>>>>>>>>>>> 8859-1.
>> >>>>> >>>>>>>>>>>> Because: (correct me if wrong :-)
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>> 1. we remove the set code in the vm. we will always get
>> null
>> >>>>> if
>> >>>>> >>>>>>>>>>>> we
>> >>>>> >>>>>>>>>>>> call
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>>  vm
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>>  method
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>> 2. we set the file.encode in the libglob.c, if we got
>> null
>> >>>>> from
>> >>>>> >>>>>>>>>>>> vm,
>> >>>>> >>>>>>>>>>>> we
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>>  set
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>>  Sorry, it should be luniglob.c
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>  8859-1.
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>> 3. we can not set file.encode on the run time.
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>> ant use UTF-8 to encode filename which contains the
>> non-ascii
>> >>>>> >>>>>>>>>>>> character.
>> >>>>> >>>>>>>>>>>> So why we use iso8859-1 as our unchangeable default?
>> >>>>> >>>>>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1,
>> it says
>> >>>>> >>>>>>>>>>>> "In
>> >>>>> >>>>>>>>>>>> computing
>> >>>>> >>>>>>>>>>>> applications, encodings that provide full UCS support
>> (such as
>> >>>>> >>>>>>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
>> >>>>> >>>>>>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are
>> finding
>> >>>>> >>>>>>>>>>>> increasing
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>>  favor
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>>  over encodings based on ISO 8859-1." Should we simply
>> change
>> >>>>> >>>>>>>>>>> iso8859-1
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>> to
>> >>>>> >>>>>>>>>>>> utf-8?
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>> --
>> >>>>> >>>>>>>>>>>> Yours sincerely,
>> >>>>> >>>>>>>>>>>> Charles Lee
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>> --
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>> Best Regards!
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>> Jimmy, Jing Lv
>> >>>>> >>>>>>>>>>> China Software Development Lab, IBM
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>> --
>> >>>>> >>>>>>>>>> Yours sincerely,
>> >>>>> >>>>>>>>>> Charles Lee
>> >>>>> >>>>>>>>>>
>> >>>>> >>>>>>>>>>
>> >>>>> >>>>>>> --
>> >>>>> >>>>>>> Best Regards,
>> >>>>> >>>>>>> Regis.
>> >>>>> >>>>>>>
>> >>>>> >>>>>
>> >>>>> >>>>> --
>> >>>>> >>>>> Best Regards,
>> >>>>> >>>>> Regis.
>> >>>>> >>>>>
>> >>>>> >>>>
>> >>>>> >>>
>> >>>>> >>>
>> >>>>> >>> --
>> >>>>> >>> Best Regards,
>> >>>>> >>> Regis.
>> >>>>> >>
>> >>>>> >>
>> >>>>> >>
>> >>>>> >> --
>> >>>>> >> Yours sincerely,
>> >>>>> >> Charles Lee
>> >>>>> >>
>> >>>>> >
>> >>>>> >
>> >>>>> >
>> >>>>> > --
>> >>>>> > Yours sincerely,
>> >>>>> > Charles Lee
>> >>>>> >
>> >>>>> >
>> >>>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Yours sincerely,
>> >>>> Charles Lee
>> >>>>
>> >>>
>> >>
>> >
>>
>
>
>
> --
> Yours sincerely,
> Charles Lee
>

Re: Shall we change our file.encoding

Reply via email to