Re: Insufficiencies in JEP: 400: UTF-8 by Default

2021-03-31 Thread Roger Riggs

Hi Anthony,

A draft of updates to the Process API is in the works and covers improving
the ease of use and providing Readers and Writers.  Note that if process 
output
is redirected to a file, it does not interpose on the byte streams and 
is not in

a position to affect the character set used by the child process.

Regards, Roger


On 3/30/21 1:03 PM, Anthony Vanelverdinghe wrote:

Hi Alan

As Marco mentioned, another use case is sub-process stdin/stdout/stderr. In my 
particular instance, I'm starting a Process which has its output redirected to 
a file. It uses the platform's default encoding for writing to stdout. So when 
I want to read its output from the file at some later point, I need to supply 
that encoding to the Files API.
One way to accommodate this use case, is a method which allows to retrieve the 
platform's default encoding, for example a method `platformEncoding` in Charset or 
Process, or the `Console::charset` method you mentioned. Another option would be to 
enhance the Process API, by adding methods to Process which return appropriate 
Readers/Writers & adding methods of the form `redirectX(File file, Charset 
encoding)` to ProcessBuilder. But this seems like a lot of additional API surface, 
just to avoid surfacing the platform's default encoding itself.
So I think the JEP should specify how it'll address use cases w.r.t. the 
Process API, shouldn't it?

Kind regards,
Anthony
  
On Sunday, March 14, 2021 13:01 CET, Alan Bateman  wrote:
  

On 14/03/2021 11:00, Marco wrote:

:

IMO Charset should provide standardized getters for the OS charset and the
console charset. The latter being different has been a long standing issue on
Windows where the codepage differs between its CLI and regular environments.
OpenJDK has the necessary data already available in its custom system
properties.

The console charset is currently hidden behind PrintStream not exposing the
underlying OSWriter and not offering getEncoding() itself. The OS charset
would be hidden in the future by Charset.getDefaultCharset()'s specification
change in JEP 400.

The intention that there will be little or no impact to the console
streams. This means that java.io.Console reader/writer methods should
continue to return a Reader/PrintWriter that uses the platform encoding
(or code page is on Windows). Same thing for the System.out/System.err
print streams. We need to make this clearer in the JEP.

There has been discussion on this mailing list about adding a
Console::charset method but it didn't come to a consensus. Naoto Sato
and I have been chatting about it again recently as there may be a need
to add an API in advance of proposing to target the JEP.

One case that we are still mulling over is code that creates an
InputStreamReader on System.in without specifying the charset. This
might be older code that pre-dates java.io.Console or maybe code that
wasn't tested on a wide range or platforms. Options range from a spec
change to doing nothing (the latter meaning running with "COMPACT" or
migrating the code to use the 2-arg constructor as the default charset
is not the right choice).

-Alan







Re: Insufficiencies in JEP: 400: UTF-8 by Default

2021-03-30 Thread Anthony Vanelverdinghe
Hi Alan

As Marco mentioned, another use case is sub-process stdin/stdout/stderr. In my 
particular instance, I'm starting a Process which has its output redirected to 
a file. It uses the platform's default encoding for writing to stdout. So when 
I want to read its output from the file at some later point, I need to supply 
that encoding to the Files API.
One way to accommodate this use case, is a method which allows to retrieve the 
platform's default encoding, for example a method `platformEncoding` in Charset 
or Process, or the `Console::charset` method you mentioned. Another option 
would be to enhance the Process API, by adding methods to Process which return 
appropriate Readers/Writers & adding methods of the form `redirectX(File file, 
Charset encoding)` to ProcessBuilder. But this seems like a lot of additional 
API surface, just to avoid surfacing the platform's default encoding itself.
So I think the JEP should specify how it'll address use cases w.r.t. the 
Process API, shouldn't it?

Kind regards,
Anthony

On Sunday, March 14, 2021 13:01 CET, Alan Bateman  
wrote:

> On 14/03/2021 11:00, Marco wrote:
> > :
> >
> > IMO Charset should provide standardized getters for the OS charset and the
> > console charset. The latter being different has been a long standing issue 
> > on
> > Windows where the codepage differs between its CLI and regular environments.
> > OpenJDK has the necessary data already available in its custom system
> > properties.
> >
> > The console charset is currently hidden behind PrintStream not exposing the
> > underlying OSWriter and not offering getEncoding() itself. The OS charset
> > would be hidden in the future by Charset.getDefaultCharset()'s specification
> > change in JEP 400.
> The intention that there will be little or no impact to the console
> streams. This means that java.io.Console reader/writer methods should
> continue to return a Reader/PrintWriter that uses the platform encoding
> (or code page is on Windows). Same thing for the System.out/System.err
> print streams. We need to make this clearer in the JEP.
>
> There has been discussion on this mailing list about adding a
> Console::charset method but it didn't come to a consensus. Naoto Sato
> and I have been chatting about it again recently as there may be a need
> to add an API in advance of proposing to target the JEP.
>
> One case that we are still mulling over is code that creates an
> InputStreamReader on System.in without specifying the charset. This
> might be older code that pre-dates java.io.Console or maybe code that
> wasn't tested on a wide range or platforms. Options range from a spec
> change to doing nothing (the latter meaning running with "COMPACT" or
> migrating the code to use the 2-arg constructor as the default charset
> is not the right choice).
>
> -Alan
>
>
>



Insufficiencies in JEP: 400: UTF-8 by Default

2021-03-14 Thread Marco
Hi all,

the JEP generally paints the picture that using the OS charset would be 
incorrect or useless, it is however the right and perfectly valid choice for 
communicating with other local programs where no other charset was specified. 
It is the same as UTF-8 most of the time, but not always and especially not on 
Windows, using UTF-8 every time would be strictly less correct.

Per [1] LC_CTYPE defines the charset to use for transforming between binary 
data and text. Given that the file.encoding system property doesn't exist 
within Java SE, LC_CTYPE combined with the current specification of 
Charset.defaultCharset() is the only compliant way to change the default 
charset in Java SE outside some custom application specific handling. Ignoring 
LC_CTYPE obviously leaves no standard approach. From the program's POV the 
same applies in reverse, currently one could only use Charset.defaultCharset() 
to determine the OS charset or let the java.io methods infer it through the 
charset-less constructors, then potentially read it back through e.g. 
InputStreamReader.getEncoding().

The OS charset is still relevant for text interaction on System.in/out/err, 
sub-process stdin/stdout/stderr and files with unknown encoding. Programs like 
grep assume the files are encoded according to LC_CTYPE, much like a similarly 
designed Java program that uses the OS charset on purpose. Constructing a 
Reader for stdin properly requires some way to determine the relevant OS 
encoding.

I'm perfectly happy with changing the charset-less methods to use UTF-8 since 
it's the best choice outside the above scenarios, despite the compatibility 
impact. Dropping standardized support for the OS charset however not only 
breaks the above interactions, but also leaves no nice migration path. The -
Dfile.encoding=COMPAT workaround is explicitly not standardized and isn't 
available to the Java application itself, only to whoever starts the JVM to 
presumably work around outdated code.

IMO Charset should provide standardized getters for the OS charset and the 
console charset. The latter being different has been a long standing issue on 
Windows where the codepage differs between its CLI and regular environments. 
OpenJDK has the necessary data already available in its custom system 
properties.

The console charset is currently hidden behind PrintStream not exposing the 
underlying OSWriter and not offering getEncoding() itself. The OS charset 
would be hidden in the future by Charset.getDefaultCharset()'s specification 
change in JEP 400.

Please consider the above minor additions to fix those issues for good.

Best regards,

Marco

[1] https://pubs.opengroup.org/onlinepubs/7908799/xbd/envvar.html




Re: Insufficiencies in JEP: 400: UTF-8 by Default

2021-03-14 Thread Alan Bateman

On 14/03/2021 11:00, Marco wrote:

:

IMO Charset should provide standardized getters for the OS charset and the
console charset. The latter being different has been a long standing issue on
Windows where the codepage differs between its CLI and regular environments.
OpenJDK has the necessary data already available in its custom system
properties.

The console charset is currently hidden behind PrintStream not exposing the
underlying OSWriter and not offering getEncoding() itself. The OS charset
would be hidden in the future by Charset.getDefaultCharset()'s specification
change in JEP 400.
The intention that there will be little or no impact to the console 
streams. This means that java.io.Console reader/writer methods should 
continue to return a Reader/PrintWriter that uses the platform encoding 
(or code page is on Windows). Same thing for the System.out/System.err 
print streams. We need to make this clearer in the JEP.


There has been discussion on this mailing list about adding a 
Console::charset method but it didn't come to a consensus. Naoto Sato 
and I have been chatting about it again recently as there may be a need 
to add an API in advance of proposing to target the JEP.


One case that we are still mulling over is code that creates an 
InputStreamReader on System.in without specifying the charset. This 
might be older code that pre-dates java.io.Console or maybe code that 
wasn't tested on a wide range or platforms. Options range from a spec 
change to doing nothing (the latter meaning running with "COMPACT" or 
migrating the code to use the 2-arg constructor as the default charset 
is not the right choice).


-Alan