Re: Shall we change our file.encoding

Charles Lee Thu, 16 Jul 2009 19:31:31 -0700

Thanks Nathan!

I will try this :-)


On Fri, Jul 17, 2009 at 10:05 AM, Nathan Beyer <ndbe...@apache.org> wrote:

> On Thu, Jul 16, 2009 at 8:50 PM, Nathan Beyer<ndbe...@apache.org> wrote:
> > On Thu, Jul 16, 2009 at 8:35 PM, Nathan Beyer<ndbe...@apache.org> wrote:
> >> On Thu, Jul 16, 2009 at 8:26 PM, Nathan Beyer<ndbe...@apache.org>
> wrote:
> >>> On Thu, Jul 16, 2009 at 8:18 PM, Charles Lee<littlee1...@gmail.com>
> wrote:
> >>>> Hi Nathan,
> >>>>
> >>>> What I got is 936, the code page identifier. Is there a api for us to
> map
> >>>> 936 to the gb2312?
> >>>
> >>> Oh, the 'identifier' bit was missing - yeah, we'll need to translate
> >>> that into a name of some sort. I'll poke around a bit and see what I
> >>> can find.
> >>
> >> We'll probably just have to put in a mapping ourselves based on the
> >> documentation. We'd call GetACP [1] and map that to a known alias in
> >> java.nio.charset that matches the definitions[2] of the identifiers.
> >>
> >> [1] http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx
> >> [2] http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx
> >
> > This may be better - APR has a function for getting the OS default
> > encoding. This would work across all platforms that APR supports and I
> > believe we already use APR.
> >
> >
> http://apr.apache.org/docs/apr/1.3/group__apr__portabile.html#g6e21845a4a5f3b7dd107b2beea50c91e
>
> However, the Windows version of this is simply - return
> apr_psprintf(pool, "CP%u", (unsigned) GetACP());. Which is essentially
> "CP" + codePageId.
>
> And the Unix version of this method doesn't look very good for our
> purposes.
> >
> > -Nathan
> >>
> >>>
> >>>> If we put 936 in the file.encoding, can we successfully get the
> encoder and
> >>>> decoder by charset?
> >>>>
> >>>> On Fri, Jul 17, 2009 at 9:05 AM, Nathan Beyer <ndbe...@apache.org>
> wrote:
> >>>>
> >>>>> On Thu, Jul 16, 2009 at 1:28 AM, Charles Lee<littlee1...@gmail.com>
> wrote:
> >>>>> > Hi guys,
> >>>>> >
> >>>>> > I have add the locale function in the drlvm, the patch is attached.
> >>>>> Please
> >>>>> > try this new patch on the linux.
> >>>>> >
> >>>>> > The patch should work on the linux but fail on the windows. Because
> >>>>> windows
> >>>>> > returns code page not charset from the setlocale.
> >>>>>
> >>>>> Code page and character set are the same thing. We shouldn't need to
> >>>>> convert it as the Charset APIs will have to support the values
> anyway.
> >>>>>
> >>>>> What's the value you're getting? If it's 'Cp1252', then we're good,
> as
> >>>>> that's just an alias for 'Windows-1252' (or vice-versa).
> >>>>>
> >>>>> -Nathan
> >>>>>
> >>>>>
> >>>>> > I hv tried long time to
> >>>>> > get the charset name from the codepage, for example:
> >>>>> > CPINFOEX cpInfoEx;
> >>>>> > BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx);
> >>>>> > if (iReturn > 0) {
> >>>>> >     printf("FULL NAME %s\n", cPinfoEx,CodePageName);
> >>>>> > }
> >>>>> > But I only get the full name without any format.
> >>>>> >
> >>>>> > There is code page identifiers map in the msdn, detail here. I may
> hard
> >>>>> code
> >>>>> > this map in the file. But the note on the msdn says:
> >>>>> >      "ANSI code pages can be different on different computers, or
> can be
> >>>>> > changed for a single computer, leading to data corruption. For the
> most
> >>>>> > consistent results, applications should use Unicode, such as UTF-8
> or
> >>>>> > UTF-16, instead of a specific code page."
> >>>>> > I am afraid hard-code will fail on some machines. (By the way, this
> seems
> >>>>> > the UTF-8 is suggested to be the default again :-)
> >>>>> >
> >>>>> > There is also a class Encoding in the VC++, detail here. But we can
> not
> >>>>> use
> >>>>> > it here.
> >>>>> >
> >>>>> > So anyone knows some thing about locale on the windows?
> >>>>> > Again, shall use UTF-8 as our default?
> >>>>> >
> >>>>> > On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee <
> littlee1...@gmail.com>
> >>>>> wrote:
> >>>>> >>
> >>>>> >> That seems we should add it in the drlvm.
> >>>>> >>
> >>>>> >> On Wed, Jul 15, 2009 at 1:58 PM, Regis <xu.re...@gmail.com>
> wrote:
> >>>>> >>>
> >>>>> >>> Nathan Beyer wrote:
> >>>>> >>>>
> >>>>> >>>> Is the IBM VME dealing with this correctly? Do we just need to
> fix
> >>>>> >>>> DRLVM?
> >>>>> >>>
> >>>>> >>> Yes, I only tested on Linux, IBM VME set the property correctly.
> >>>>> >>>
> >>>>> >>>>
> >>>>> >>>> On Wed, Jul 15, 2009 at 12:25 AM, Regis<xu.re...@gmail.com>
> wrote:
> >>>>> >>>>>
> >>>>> >>>>> Kevin Zhou wrote:
> >>>>> >>>>>>
> >>>>> >>>>>> Yea, from luniglob.c, CL attempts to read the "file.encoding"
> >>>>> property
> >>>>> >>>>>> adown
> >>>>> >>>>>> VM but fails to get the correct encoding.
> >>>>> >>>>>>
> >>>>> >>>>>> Regis, do you know any other specific ways that CL can gain
> the
> >>>>> right
> >>>>> >>>>>> property?
> >>>>> >>>>>
> >>>>> >>>>> We can get from OS directly. Maybe just read env variables on
> Linux?
> >>>>> >>>>>
> >>>>> >>>>>> Wed, Jul 15, 2009 at 9:59 AM, Regis <xu.re...@gmail.com>
> wrote:
> >>>>> >>>>>>
> >>>>> >>>>>>> Charles Lee wrote:
> >>>>> >>>>>>>
> >>>>> >>>>>>>> Hi Nanthan,
> >>>>> >>>>>>>>
> >>>>> >>>>>>>> If the file encoding derive from the OS, it should be the
> some
> >>>>> bugs
> >>>>> >>>>>>>> in
> >>>>> >>>>>>>> it
> >>>>> >>>>>>>> because on my LINUX machine the locale is en_US.UTF-8. Our
> default
> >>>>> >>>>>>>> codec
> >>>>> >>>>>>>> is
> >>>>> >>>>>>>> still ISO8859-1. Do you know where can we found such codes?
> >>>>> >>>>>>>>
> >>>>> >>>>>>> Classlib expected vm do this and set the property, but it
> didn't,
> >>>>> so
> >>>>> >>>>>>> we
> >>>>> >>>>>>> have to do this by ourselves.
> >>>>> >>>>>>>
> >>>>> >>>>>>>
> >>>>> >>>>>>>
> >>>>> >>>>>>>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <
> nbe...@gmail.com>
> >>>>> >>>>>>>> wrote:
> >>>>> >>>>>>>>
> >>>>> >>>>>>>>  Are we talking about windows or linux?the default file
> encoding
> >>>>> >>>>>>>> should
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>> derive from the OS. I believe that's defined by the specs.
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>> Sent from my iPhone
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <
> littlee1...@gmail.com>
> >>>>> >>>>>>>>> wrote:
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv
> >>>>> >>>>>>>>> <firep...@gmail.com>
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>>> wrote:
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>>  Hi,
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>>>  Charles, I believe UTF-8 is the default encoding for RI,
> and
> >>>>> it
> >>>>> >>>>>>>>>>> sounds
> >>>>> >>>>>>>>>>> reasonable.
> >>>>> >>>>>>>>>>>  BTW, it may encounter some compatibility problem, maybe
> we
> >>>>> need
> >>>>> >>>>>>>>>>> to
> >>>>> >>>>>>>>>>> run
> >>>>> >>>>>>>>>>> more tests to verify?
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>> 2009/7/14 Charles Lee <littlee1...@gmail.com>
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>>  Hi guys:
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>>> I am doing some test cases on the ant junit test case
> and
> >>>>> >>>>>>>>>>>> meeting
> >>>>> >>>>>>>>>>>> some
> >>>>> >>>>>>>>>>>> encoding problems. I find they are maybe caused by the
> >>>>> different
> >>>>> >>>>>>>>>>>> default
> >>>>> >>>>>>>>>>>> encoding from RI and harmony. My local is en_US.UTF-8,
> RI
> >>>>> >>>>>>>>>>>> default is
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>  UTF-8
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>>  but harmony is 8859-1. And then I have encountered
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>> HARMONY-3736<
> >>>>> https://issues.apache.org/jira/browse/HARMONY-3736>,
> >>>>> >>>>>>>>>>>> and the two diffs attached on that issue. It seems we
> always
> >>>>> get
> >>>>> >>>>>>>>>>>> 8859-1.
> >>>>> >>>>>>>>>>>> Because: (correct me if wrong :-)
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>> 1. we remove the set code in the vm. we will always get
> null
> >>>>> if
> >>>>> >>>>>>>>>>>> we
> >>>>> >>>>>>>>>>>> call
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>  vm
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>>  method
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>> 2. we set the file.encode in the libglob.c, if we got
> null
> >>>>> from
> >>>>> >>>>>>>>>>>> vm,
> >>>>> >>>>>>>>>>>> we
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>  set
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>>  Sorry, it should be luniglob.c
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>  8859-1.
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>> 3. we can not set file.encode on the run time.
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>> ant use UTF-8 to encode filename which contains the
> non-ascii
> >>>>> >>>>>>>>>>>> character.
> >>>>> >>>>>>>>>>>> So why we use iso8859-1 as our unchangeable default?
> >>>>> >>>>>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1,
> it says
> >>>>> >>>>>>>>>>>> "In
> >>>>> >>>>>>>>>>>> computing
> >>>>> >>>>>>>>>>>> applications, encodings that provide full UCS support
> (such as
> >>>>> >>>>>>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
> >>>>> >>>>>>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are
> finding
> >>>>> >>>>>>>>>>>> increasing
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>  favor
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>>  over encodings based on ISO 8859-1." Should we simply
> change
> >>>>> >>>>>>>>>>> iso8859-1
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>> to
> >>>>> >>>>>>>>>>>> utf-8?
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>> --
> >>>>> >>>>>>>>>>>> Yours sincerely,
> >>>>> >>>>>>>>>>>> Charles Lee
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>> --
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>> Best Regards!
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>> Jimmy, Jing Lv
> >>>>> >>>>>>>>>>> China Software Development Lab, IBM
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>> --
> >>>>> >>>>>>>>>> Yours sincerely,
> >>>>> >>>>>>>>>> Charles Lee
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>> --
> >>>>> >>>>>>> Best Regards,
> >>>>> >>>>>>> Regis.
> >>>>> >>>>>>>
> >>>>> >>>>>
> >>>>> >>>>> --
> >>>>> >>>>> Best Regards,
> >>>>> >>>>> Regis.
> >>>>> >>>>>
> >>>>> >>>>
> >>>>> >>>
> >>>>> >>>
> >>>>> >>> --
> >>>>> >>> Best Regards,
> >>>>> >>> Regis.
> >>>>> >>
> >>>>> >>
> >>>>> >>
> >>>>> >> --
> >>>>> >> Yours sincerely,
> >>>>> >> Charles Lee
> >>>>> >>
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > --
> >>>>> > Yours sincerely,
> >>>>> > Charles Lee
> >>>>> >
> >>>>> >
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Yours sincerely,
> >>>> Charles Lee
> >>>>
> >>>
> >>
> >
>



-- 
Yours sincerely,
Charles Lee

Re: Shall we change our file.encoding

Reply via email to