Thanks Nathan! I will try this :-)
On Fri, Jul 17, 2009 at 10:05 AM, Nathan Beyer <ndbe...@apache.org> wrote: > On Thu, Jul 16, 2009 at 8:50 PM, Nathan Beyer<ndbe...@apache.org> wrote: > > On Thu, Jul 16, 2009 at 8:35 PM, Nathan Beyer<ndbe...@apache.org> wrote: > >> On Thu, Jul 16, 2009 at 8:26 PM, Nathan Beyer<ndbe...@apache.org> > wrote: > >>> On Thu, Jul 16, 2009 at 8:18 PM, Charles Lee<littlee1...@gmail.com> > wrote: > >>>> Hi Nathan, > >>>> > >>>> What I got is 936, the code page identifier. Is there a api for us to > map > >>>> 936 to the gb2312? > >>> > >>> Oh, the 'identifier' bit was missing - yeah, we'll need to translate > >>> that into a name of some sort. I'll poke around a bit and see what I > >>> can find. > >> > >> We'll probably just have to put in a mapping ourselves based on the > >> documentation. We'd call GetACP [1] and map that to a known alias in > >> java.nio.charset that matches the definitions[2] of the identifiers. > >> > >> [1] http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx > >> [2] http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx > > > > This may be better - APR has a function for getting the OS default > > encoding. This would work across all platforms that APR supports and I > > believe we already use APR. > > > > > http://apr.apache.org/docs/apr/1.3/group__apr__portabile.html#g6e21845a4a5f3b7dd107b2beea50c91e > > However, the Windows version of this is simply - return > apr_psprintf(pool, "CP%u", (unsigned) GetACP());. Which is essentially > "CP" + codePageId. > > And the Unix version of this method doesn't look very good for our > purposes. > > > > -Nathan > >> > >>> > >>>> If we put 936 in the file.encoding, can we successfully get the > encoder and > >>>> decoder by charset? > >>>> > >>>> On Fri, Jul 17, 2009 at 9:05 AM, Nathan Beyer <ndbe...@apache.org> > wrote: > >>>> > >>>>> On Thu, Jul 16, 2009 at 1:28 AM, Charles Lee<littlee1...@gmail.com> > wrote: > >>>>> > Hi guys, > >>>>> > > >>>>> > I have add the locale function in the drlvm, the patch is attached. > >>>>> Please > >>>>> > try this new patch on the linux. > >>>>> > > >>>>> > The patch should work on the linux but fail on the windows. Because > >>>>> windows > >>>>> > returns code page not charset from the setlocale. > >>>>> > >>>>> Code page and character set are the same thing. We shouldn't need to > >>>>> convert it as the Charset APIs will have to support the values > anyway. > >>>>> > >>>>> What's the value you're getting? If it's 'Cp1252', then we're good, > as > >>>>> that's just an alias for 'Windows-1252' (or vice-versa). > >>>>> > >>>>> -Nathan > >>>>> > >>>>> > >>>>> > I hv tried long time to > >>>>> > get the charset name from the codepage, for example: > >>>>> > CPINFOEX cpInfoEx; > >>>>> > BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx); > >>>>> > if (iReturn > 0) { > >>>>> > printf("FULL NAME %s\n", cPinfoEx,CodePageName); > >>>>> > } > >>>>> > But I only get the full name without any format. > >>>>> > > >>>>> > There is code page identifiers map in the msdn, detail here. I may > hard > >>>>> code > >>>>> > this map in the file. But the note on the msdn says: > >>>>> > "ANSI code pages can be different on different computers, or > can be > >>>>> > changed for a single computer, leading to data corruption. For the > most > >>>>> > consistent results, applications should use Unicode, such as UTF-8 > or > >>>>> > UTF-16, instead of a specific code page." > >>>>> > I am afraid hard-code will fail on some machines. (By the way, this > seems > >>>>> > the UTF-8 is suggested to be the default again :-) > >>>>> > > >>>>> > There is also a class Encoding in the VC++, detail here. But we can > not > >>>>> use > >>>>> > it here. > >>>>> > > >>>>> > So anyone knows some thing about locale on the windows? > >>>>> > Again, shall use UTF-8 as our default? > >>>>> > > >>>>> > On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee < > littlee1...@gmail.com> > >>>>> wrote: > >>>>> >> > >>>>> >> That seems we should add it in the drlvm. > >>>>> >> > >>>>> >> On Wed, Jul 15, 2009 at 1:58 PM, Regis <xu.re...@gmail.com> > wrote: > >>>>> >>> > >>>>> >>> Nathan Beyer wrote: > >>>>> >>>> > >>>>> >>>> Is the IBM VME dealing with this correctly? Do we just need to > fix > >>>>> >>>> DRLVM? > >>>>> >>> > >>>>> >>> Yes, I only tested on Linux, IBM VME set the property correctly. > >>>>> >>> > >>>>> >>>> > >>>>> >>>> On Wed, Jul 15, 2009 at 12:25 AM, Regis<xu.re...@gmail.com> > wrote: > >>>>> >>>>> > >>>>> >>>>> Kevin Zhou wrote: > >>>>> >>>>>> > >>>>> >>>>>> Yea, from luniglob.c, CL attempts to read the "file.encoding" > >>>>> property > >>>>> >>>>>> adown > >>>>> >>>>>> VM but fails to get the correct encoding. > >>>>> >>>>>> > >>>>> >>>>>> Regis, do you know any other specific ways that CL can gain > the > >>>>> right > >>>>> >>>>>> property? > >>>>> >>>>> > >>>>> >>>>> We can get from OS directly. Maybe just read env variables on > Linux? > >>>>> >>>>> > >>>>> >>>>>> Wed, Jul 15, 2009 at 9:59 AM, Regis <xu.re...@gmail.com> > wrote: > >>>>> >>>>>> > >>>>> >>>>>>> Charles Lee wrote: > >>>>> >>>>>>> > >>>>> >>>>>>>> Hi Nanthan, > >>>>> >>>>>>>> > >>>>> >>>>>>>> If the file encoding derive from the OS, it should be the > some > >>>>> bugs > >>>>> >>>>>>>> in > >>>>> >>>>>>>> it > >>>>> >>>>>>>> because on my LINUX machine the locale is en_US.UTF-8. Our > default > >>>>> >>>>>>>> codec > >>>>> >>>>>>>> is > >>>>> >>>>>>>> still ISO8859-1. Do you know where can we found such codes? > >>>>> >>>>>>>> > >>>>> >>>>>>> Classlib expected vm do this and set the property, but it > didn't, > >>>>> so > >>>>> >>>>>>> we > >>>>> >>>>>>> have to do this by ourselves. > >>>>> >>>>>>> > >>>>> >>>>>>> > >>>>> >>>>>>> > >>>>> >>>>>>>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer < > nbe...@gmail.com> > >>>>> >>>>>>>> wrote: > >>>>> >>>>>>>> > >>>>> >>>>>>>> Are we talking about windows or linux?the default file > encoding > >>>>> >>>>>>>> should > >>>>> >>>>>>>>> > >>>>> >>>>>>>>> derive from the OS. I believe that's defined by the specs. > >>>>> >>>>>>>>> > >>>>> >>>>>>>>> Sent from my iPhone > >>>>> >>>>>>>>> > >>>>> >>>>>>>>> > >>>>> >>>>>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee < > littlee1...@gmail.com> > >>>>> >>>>>>>>> wrote: > >>>>> >>>>>>>>> > >>>>> >>>>>>>>> On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv > >>>>> >>>>>>>>> <firep...@gmail.com> > >>>>> >>>>>>>>> > >>>>> >>>>>>>>>> wrote: > >>>>> >>>>>>>>>> > >>>>> >>>>>>>>>> Hi, > >>>>> >>>>>>>>>> > >>>>> >>>>>>>>>>> Charles, I believe UTF-8 is the default encoding for RI, > and > >>>>> it > >>>>> >>>>>>>>>>> sounds > >>>>> >>>>>>>>>>> reasonable. > >>>>> >>>>>>>>>>> BTW, it may encounter some compatibility problem, maybe > we > >>>>> need > >>>>> >>>>>>>>>>> to > >>>>> >>>>>>>>>>> run > >>>>> >>>>>>>>>>> more tests to verify? > >>>>> >>>>>>>>>>> > >>>>> >>>>>>>>>>> 2009/7/14 Charles Lee <littlee1...@gmail.com> > >>>>> >>>>>>>>>>> > >>>>> >>>>>>>>>>> Hi guys: > >>>>> >>>>>>>>>>> > >>>>> >>>>>>>>>>>> I am doing some test cases on the ant junit test case > and > >>>>> >>>>>>>>>>>> meeting > >>>>> >>>>>>>>>>>> some > >>>>> >>>>>>>>>>>> encoding problems. I find they are maybe caused by the > >>>>> different > >>>>> >>>>>>>>>>>> default > >>>>> >>>>>>>>>>>> encoding from RI and harmony. My local is en_US.UTF-8, > RI > >>>>> >>>>>>>>>>>> default is > >>>>> >>>>>>>>>>>> > >>>>> >>>>>>>>>>>> UTF-8 > >>>>> >>>>>>>>>>> > >>>>> >>>>>>>>>>> but harmony is 8859-1. And then I have encountered > >>>>> >>>>>>>>>>>> > >>>>> >>>>>>>>>>>> > >>>>> >>>>>>>>>>>> HARMONY-3736< > >>>>> https://issues.apache.org/jira/browse/HARMONY-3736>, > >>>>> >>>>>>>>>>>> and the two diffs attached on that issue. It seems we > always > >>>>> get > >>>>> >>>>>>>>>>>> 8859-1. > >>>>> >>>>>>>>>>>> Because: (correct me if wrong :-) > >>>>> >>>>>>>>>>>> > >>>>> >>>>>>>>>>>> 1. we remove the set code in the vm. we will always get > null > >>>>> if > >>>>> >>>>>>>>>>>> we > >>>>> >>>>>>>>>>>> call > >>>>> >>>>>>>>>>>> > >>>>> >>>>>>>>>>>> vm > >>>>> >>>>>>>>>>> > >>>>> >>>>>>>>>>> method > >>>>> >>>>>>>>>>>> > >>>>> >>>>>>>>>>>> 2. we set the file.encode in the libglob.c, if we got > null > >>>>> from > >>>>> >>>>>>>>>>>> vm, > >>>>> >>>>>>>>>>>> we > >>>>> >>>>>>>>>>>> > >>>>> >>>>>>>>>>>> set > >>>>> >>>>>>>>>>> > >>>>> >>>>>>>>>>> Sorry, it should be luniglob.c > >>>>> >>>>>>>>>>> > >>>>> >>>>>>>>>> 8859-1. > >>>>> >>>>>>>>>>>> > >>>>> >>>>>>>>>>>> 3. we can not set file.encode on the run time. > >>>>> >>>>>>>>>>>> > >>>>> >>>>>>>>>>>> ant use UTF-8 to encode filename which contains the > non-ascii > >>>>> >>>>>>>>>>>> character. > >>>>> >>>>>>>>>>>> So why we use iso8859-1 as our unchangeable default? > >>>>> >>>>>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, > it says > >>>>> >>>>>>>>>>>> "In > >>>>> >>>>>>>>>>>> computing > >>>>> >>>>>>>>>>>> applications, encodings that provide full UCS support > (such as > >>>>> >>>>>>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and > >>>>> >>>>>>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are > finding > >>>>> >>>>>>>>>>>> increasing > >>>>> >>>>>>>>>>>> > >>>>> >>>>>>>>>>>> favor > >>>>> >>>>>>>>>>> > >>>>> >>>>>>>>>>> over encodings based on ISO 8859-1." Should we simply > change > >>>>> >>>>>>>>>>> iso8859-1 > >>>>> >>>>>>>>>>>> > >>>>> >>>>>>>>>>>> to > >>>>> >>>>>>>>>>>> utf-8? > >>>>> >>>>>>>>>>>> > >>>>> >>>>>>>>>>>> -- > >>>>> >>>>>>>>>>>> Yours sincerely, > >>>>> >>>>>>>>>>>> Charles Lee > >>>>> >>>>>>>>>>>> > >>>>> >>>>>>>>>>>> > >>>>> >>>>>>>>>>>> > >>>>> >>>>>>>>>>> -- > >>>>> >>>>>>>>>>> > >>>>> >>>>>>>>>>> Best Regards! > >>>>> >>>>>>>>>>> > >>>>> >>>>>>>>>>> Jimmy, Jing Lv > >>>>> >>>>>>>>>>> China Software Development Lab, IBM > >>>>> >>>>>>>>>>> > >>>>> >>>>>>>>>>> > >>>>> >>>>>>>>>>> > >>>>> >>>>>>>>>> -- > >>>>> >>>>>>>>>> Yours sincerely, > >>>>> >>>>>>>>>> Charles Lee > >>>>> >>>>>>>>>> > >>>>> >>>>>>>>>> > >>>>> >>>>>>> -- > >>>>> >>>>>>> Best Regards, > >>>>> >>>>>>> Regis. > >>>>> >>>>>>> > >>>>> >>>>> > >>>>> >>>>> -- > >>>>> >>>>> Best Regards, > >>>>> >>>>> Regis. > >>>>> >>>>> > >>>>> >>>> > >>>>> >>> > >>>>> >>> > >>>>> >>> -- > >>>>> >>> Best Regards, > >>>>> >>> Regis. > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> -- > >>>>> >> Yours sincerely, > >>>>> >> Charles Lee > >>>>> >> > >>>>> > > >>>>> > > >>>>> > > >>>>> > -- > >>>>> > Yours sincerely, > >>>>> > Charles Lee > >>>>> > > >>>>> > > >>>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> Yours sincerely, > >>>> Charles Lee > >>>> > >>> > >> > > > -- Yours sincerely, Charles Lee