Re: Shall we change our file.encoding

Charles Lee Fri, 17 Jul 2009 06:43:42 -0700

On Fri, Jul 17, 2009 at 11:17 AM, Nathan Beyer <[email protected]> wrote:


> On Thu, Jul 16, 2009 at 9:30 PM, Charles Lee<[email protected]> wrote:
> > Thanks Nathan!
> >
> > I will try this :-)
>
> Where do we define the user's locale and system locale? It seems like
> all of this should be located there and associated with that process.>
>

Sorry Nathan, I do not catch that. Do mean shall we get the user's locale or
system locale?


> > On Fri, Jul 17, 2009 at 10:05 AM, Nathan Beyer <[email protected]>
> wrote:
> >
> >> On Thu, Jul 16, 2009 at 8:50 PM, Nathan Beyer<[email protected]>
> wrote:
> >> > On Thu, Jul 16, 2009 at 8:35 PM, Nathan Beyer<[email protected]>
> wrote:
> >> >> On Thu, Jul 16, 2009 at 8:26 PM, Nathan Beyer<[email protected]>
> >> wrote:
> >> >>> On Thu, Jul 16, 2009 at 8:18 PM, Charles Lee<[email protected]>
> >> wrote:
> >> >>>> Hi Nathan,
> >> >>>>
> >> >>>> What I got is 936, the code page identifier. Is there a api for us
> to
> >> map
> >> >>>> 936 to the gb2312?
> >> >>>
> >> >>> Oh, the 'identifier' bit was missing - yeah, we'll need to translate
> >> >>> that into a name of some sort. I'll poke around a bit and see what I
> >> >>> can find.
> >> >>
> >> >> We'll probably just have to put in a mapping ourselves based on the
> >> >> documentation. We'd call GetACP [1] and map that to a known alias in
> >> >> java.nio.charset that matches the definitions[2] of the identifiers.
> >> >>
> >> >> [1] http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx
> >> >> [2] http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx
> >> >
> >> > This may be better - APR has a function for getting the OS default
> >> > encoding. This would work across all platforms that APR supports and I
> >> > believe we already use APR.
> >> >
> >> >
> >>
> http://apr.apache.org/docs/apr/1.3/group__apr__portabile.html#g6e21845a4a5f3b7dd107b2beea50c91e
> >>
> >> However, the Windows version of this is simply - return
> >> apr_psprintf(pool, "CP%u", (unsigned) GetACP());. Which is essentially
> >> "CP" + codePageId.
> >>
> >> And the Unix version of this method doesn't look very good for our
> >> purposes.
> >> >
> >> > -Nathan
> >> >>
> >> >>>
> >> >>>> If we put 936 in the file.encoding, can we successfully get the
> >> encoder and
> >> >>>> decoder by charset?
> >> >>>>
> >> >>>> On Fri, Jul 17, 2009 at 9:05 AM, Nathan Beyer <[email protected]>
> >> wrote:
> >> >>>>
> >> >>>>> On Thu, Jul 16, 2009 at 1:28 AM, Charles Lee<
> [email protected]>
> >> wrote:
> >> >>>>> > Hi guys,
> >> >>>>> >
> >> >>>>> > I have add the locale function in the drlvm, the patch is
> attached.
> >> >>>>> Please
> >> >>>>> > try this new patch on the linux.
> >> >>>>> >
> >> >>>>> > The patch should work on the linux but fail on the windows.
> Because
> >> >>>>> windows
> >> >>>>> > returns code page not charset from the setlocale.
> >> >>>>>
> >> >>>>> Code page and character set are the same thing. We shouldn't need
> to
> >> >>>>> convert it as the Charset APIs will have to support the values
> >> anyway.
> >> >>>>>
> >> >>>>> What's the value you're getting? If it's 'Cp1252', then we're
> good,
> >> as
> >> >>>>> that's just an alias for 'Windows-1252' (or vice-versa).
> >> >>>>>
> >> >>>>> -Nathan
> >> >>>>>
> >> >>>>>
> >> >>>>> > I hv tried long time to
> >> >>>>> > get the charset name from the codepage, for example:
> >> >>>>> > CPINFOEX cpInfoEx;
> >> >>>>> > BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx);
> >> >>>>> > if (iReturn > 0) {
> >> >>>>> >     printf("FULL NAME %s\n", cPinfoEx,CodePageName);
> >> >>>>> > }
> >> >>>>> > But I only get the full name without any format.
> >> >>>>> >
> >> >>>>> > There is code page identifiers map in the msdn, detail here. I
> may
> >> hard
> >> >>>>> code
> >> >>>>> > this map in the file. But the note on the msdn says:
> >> >>>>> >      "ANSI code pages can be different on different computers,
> or
> >> can be
> >> >>>>> > changed for a single computer, leading to data corruption. For
> the
> >> most
> >> >>>>> > consistent results, applications should use Unicode, such as
> UTF-8
> >> or
> >> >>>>> > UTF-16, instead of a specific code page."
> >> >>>>> > I am afraid hard-code will fail on some machines. (By the way,
> this
> >> seems
> >> >>>>> > the UTF-8 is suggested to be the default again :-)
> >> >>>>> >
> >> >>>>> > There is also a class Encoding in the VC++, detail here. But we
> can
> >> not
> >> >>>>> use
> >> >>>>> > it here.
> >> >>>>> >
> >> >>>>> > So anyone knows some thing about locale on the windows?
> >> >>>>> > Again, shall use UTF-8 as our default?
> >> >>>>> >
> >> >>>>> > On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee <
> >> [email protected]>
> >> >>>>> wrote:
> >> >>>>> >>
> >> >>>>> >> That seems we should add it in the drlvm.
> >> >>>>> >>
> >> >>>>> >> On Wed, Jul 15, 2009 at 1:58 PM, Regis <[email protected]>
> >> wrote:
> >> >>>>> >>>
> >> >>>>> >>> Nathan Beyer wrote:
> >> >>>>> >>>>
> >> >>>>> >>>> Is the IBM VME dealing with this correctly? Do we just need
> to
> >> fix
> >> >>>>> >>>> DRLVM?
> >> >>>>> >>>
> >> >>>>> >>> Yes, I only tested on Linux, IBM VME set the property
> correctly.
> >> >>>>> >>>
> >> >>>>> >>>>
> >> >>>>> >>>> On Wed, Jul 15, 2009 at 12:25 AM, Regis<[email protected]>
> >> wrote:
> >> >>>>> >>>>>
> >> >>>>> >>>>> Kevin Zhou wrote:
> >> >>>>> >>>>>>
> >> >>>>> >>>>>> Yea, from luniglob.c, CL attempts to read the
> "file.encoding"
> >> >>>>> property
> >> >>>>> >>>>>> adown
> >> >>>>> >>>>>> VM but fails to get the correct encoding.
> >> >>>>> >>>>>>
> >> >>>>> >>>>>> Regis, do you know any other specific ways that CL can gain
> >> the
> >> >>>>> right
> >> >>>>> >>>>>> property?
> >> >>>>> >>>>>
> >> >>>>> >>>>> We can get from OS directly. Maybe just read env variables
> on
> >> Linux?
> >> >>>>> >>>>>
> >> >>>>> >>>>>> Wed, Jul 15, 2009 at 9:59 AM, Regis <[email protected]>
> >> wrote:
> >> >>>>> >>>>>>
> >> >>>>> >>>>>>> Charles Lee wrote:
> >> >>>>> >>>>>>>
> >> >>>>> >>>>>>>> Hi Nanthan,
> >> >>>>> >>>>>>>>
> >> >>>>> >>>>>>>> If the file encoding derive from the OS, it should be the
> >> some
> >> >>>>> bugs
> >> >>>>> >>>>>>>> in
> >> >>>>> >>>>>>>> it
> >> >>>>> >>>>>>>> because on my LINUX machine the locale is en_US.UTF-8.
> Our
> >> default
> >> >>>>> >>>>>>>> codec
> >> >>>>> >>>>>>>> is
> >> >>>>> >>>>>>>> still ISO8859-1. Do you know where can we found such
> codes?
> >> >>>>> >>>>>>>>
> >> >>>>> >>>>>>> Classlib expected vm do this and set the property, but it
> >> didn't,
> >> >>>>> so
> >> >>>>> >>>>>>> we
> >> >>>>> >>>>>>> have to do this by ourselves.
> >> >>>>> >>>>>>>
> >> >>>>> >>>>>>>
> >> >>>>> >>>>>>>
> >> >>>>> >>>>>>>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <
> >> [email protected]>
> >> >>>>> >>>>>>>> wrote:
> >> >>>>> >>>>>>>>
> >> >>>>> >>>>>>>>  Are we talking about windows or linux?the default file
> >> encoding
> >> >>>>> >>>>>>>> should
> >> >>>>> >>>>>>>>>
> >> >>>>> >>>>>>>>> derive from the OS. I believe that's defined by the
> specs.
> >> >>>>> >>>>>>>>>
> >> >>>>> >>>>>>>>> Sent from my iPhone
> >> >>>>> >>>>>>>>>
> >> >>>>> >>>>>>>>>
> >> >>>>> >>>>>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <
> >> [email protected]>
> >> >>>>> >>>>>>>>> wrote:
> >> >>>>> >>>>>>>>>
> >> >>>>> >>>>>>>>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv
> >> >>>>> >>>>>>>>> <[email protected]>
> >> >>>>> >>>>>>>>>
> >> >>>>> >>>>>>>>>> wrote:
> >> >>>>> >>>>>>>>>>
> >> >>>>> >>>>>>>>>>  Hi,
> >> >>>>> >>>>>>>>>>
> >> >>>>> >>>>>>>>>>>  Charles, I believe UTF-8 is the default encoding for
> RI,
> >> and
> >> >>>>> it
> >> >>>>> >>>>>>>>>>> sounds
> >> >>>>> >>>>>>>>>>> reasonable.
> >> >>>>> >>>>>>>>>>>  BTW, it may encounter some compatibility problem,
> maybe
> >> we
> >> >>>>> need
> >> >>>>> >>>>>>>>>>> to
> >> >>>>> >>>>>>>>>>> run
> >> >>>>> >>>>>>>>>>> more tests to verify?
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>> 2009/7/14 Charles Lee <[email protected]>
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>  Hi guys:
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>> I am doing some test cases on the ant junit test case
> >> and
> >> >>>>> >>>>>>>>>>>> meeting
> >> >>>>> >>>>>>>>>>>> some
> >> >>>>> >>>>>>>>>>>> encoding problems. I find they are maybe caused by
> the
> >> >>>>> different
> >> >>>>> >>>>>>>>>>>> default
> >> >>>>> >>>>>>>>>>>> encoding from RI and harmony. My local is
> en_US.UTF-8,
> >> RI
> >> >>>>> >>>>>>>>>>>> default is
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>>  UTF-8
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>  but harmony is 8859-1. And then I have encountered
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>> HARMONY-3736<
> >> >>>>> https://issues.apache.org/jira/browse/HARMONY-3736>,
> >> >>>>> >>>>>>>>>>>> and the two diffs attached on that issue. It seems we
> >> always
> >> >>>>> get
> >> >>>>> >>>>>>>>>>>> 8859-1.
> >> >>>>> >>>>>>>>>>>> Because: (correct me if wrong :-)
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>> 1. we remove the set code in the vm. we will always
> get
> >> null
> >> >>>>> if
> >> >>>>> >>>>>>>>>>>> we
> >> >>>>> >>>>>>>>>>>> call
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>>  vm
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>  method
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>> 2. we set the file.encode in the libglob.c, if we got
> >> null
> >> >>>>> from
> >> >>>>> >>>>>>>>>>>> vm,
> >> >>>>> >>>>>>>>>>>> we
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>>  set
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>  Sorry, it should be luniglob.c
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>  8859-1.
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>> 3. we can not set file.encode on the run time.
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>> ant use UTF-8 to encode filename which contains the
> >> non-ascii
> >> >>>>> >>>>>>>>>>>> character.
> >> >>>>> >>>>>>>>>>>> So why we use iso8859-1 as our unchangeable default?
> >> >>>>> >>>>>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1
> ,
> >> it says
> >> >>>>> >>>>>>>>>>>> "In
> >> >>>>> >>>>>>>>>>>> computing
> >> >>>>> >>>>>>>>>>>> applications, encodings that provide full UCS support
> >> (such as
> >> >>>>> >>>>>>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
> >> >>>>> >>>>>>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are
> >> finding
> >> >>>>> >>>>>>>>>>>> increasing
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>>  favor
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>  over encodings based on ISO 8859-1." Should we simply
> >> change
> >> >>>>> >>>>>>>>>>> iso8859-1
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>> to
> >> >>>>> >>>>>>>>>>>> utf-8?
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>> --
> >> >>>>> >>>>>>>>>>>> Yours sincerely,
> >> >>>>> >>>>>>>>>>>> Charles Lee
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>> --
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>> Best Regards!
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>> Jimmy, Jing Lv
> >> >>>>> >>>>>>>>>>> China Software Development Lab, IBM
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>> --
> >> >>>>> >>>>>>>>>> Yours sincerely,
> >> >>>>> >>>>>>>>>> Charles Lee
> >> >>>>> >>>>>>>>>>
> >> >>>>> >>>>>>>>>>
> >> >>>>> >>>>>>> --
> >> >>>>> >>>>>>> Best Regards,
> >> >>>>> >>>>>>> Regis.
> >> >>>>> >>>>>>>
> >> >>>>> >>>>>
> >> >>>>> >>>>> --
> >> >>>>> >>>>> Best Regards,
> >> >>>>> >>>>> Regis.
> >> >>>>> >>>>>
> >> >>>>> >>>>
> >> >>>>> >>>
> >> >>>>> >>>
> >> >>>>> >>> --
> >> >>>>> >>> Best Regards,
> >> >>>>> >>> Regis.
> >> >>>>> >>
> >> >>>>> >>
> >> >>>>> >>
> >> >>>>> >> --
> >> >>>>> >> Yours sincerely,
> >> >>>>> >> Charles Lee
> >> >>>>> >>
> >> >>>>> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > --
> >> >>>>> > Yours sincerely,
> >> >>>>> > Charles Lee
> >> >>>>> >
> >> >>>>> >
> >> >>>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> Yours sincerely,
> >> >>>> Charles Lee
> >> >>>>
> >> >>>
> >> >>
> >> >
> >>
> >
> >
> >
> > --
> > Yours sincerely,
> > Charles Lee
> >
>



-- 
Yours sincerely,
Charles Lee

Re: Shall we change our file.encoding

Reply via email to