Re: UTF-8 properties files and BOMs

Mark Thomas Tue, 11 Feb 2020 06:48:12 -0800

On 11/02/2020 14:26, Christopher Schultz wrote:

<snip/>


>> The thing that bugged me was having to manually switch properties
>> files to UTF-8 to view them "properly". You mail motivated me to
>> track down where I can change that in Eclipse:
> 
>> Window->Preferences->General->Content Types
> 
>> and I have changed Java properties files to use UTF-8. So that is
>> my personal niggle fixed. Thanks for the motivation.
> 
> Yes, this *will* fix things, but:
> 
> 1. It's a global setting, so it can't be set on a per-project basis.
> That means you have to be willing to convert ALL your properties files
> across ALL your projects to UTF-8. That may be okay for some people,
> but not all.

Fair point.

> 2. This is a guess: Tomcat's ide-eclipse ant target can't set that
> setting for the Tomcat project(s) because it's a global setting.
> Therefore, anyone using Eclipse as an IDE will have to manually set
> their content-type in order to NOT damage any of the files we ship.

I'm not sure about actual damage. I've see Eclipse manipulate UTF-8
files while configured to use ISO-8859-1 without issue. But maybe that
is actually git doing UTF-8 manipulation.

>> I was concerned that adding a BOM would cause problems when
>> reading property files. I've seen reports of that with Java in the
>> past. A quick test suggests that the issue is no longer present
>> with latest Java 8.
> 
> I actually had another problem after I implemented all of this: any
> property file without a blank and/or comment line at the top ended up
> with a mangled and unusable *first* property key. A file like this:
> 
> first.property=foo
> second.property=bar
> 
> Would end up line this after a trip through "native2ascii -encoding
> UTF-8":
> 
> \ufefffirst.property=foo
> second.property=bar

That is similar to the problems I recall with earlier versions of Java.

> native2ascii stupidly interprets the UTF-8 BOM as an actual character,
> and encodes it in the output.
> 
> This appears to be a bug in (at least old versions of) Java and/or
> native2ascii. I've got local installations of Java 8, 11 (Adopt), 11
> (Oracle), and 13 (OpenJDK), and only Java 8 has a "native2ascii"
> binary present. I see ant's <native2ascii> task has its own
> implementation, but it's probably very simple, just like the
> native2ascii program itself. Java's Reader classes incorrectly
> interpret the BOM as an actual character instead of an ignorable UTF-8
> control sequence.

But the chances of us being able to "fix" the Ant implementation are
considerably higher :).

> Ensuring that the first line of the file is a comment or a blank line
> fixes things:
> 
> # BOM
> first.property=foo
> second.property=bar
> 
> becomes:
> 
> \ufeff# BOM
> first.property=foo
> second.property=bar

Does the BOM end up creating an additional property in this case?

>> Overall, I guess I am -0 on adding BOMs.
> 
> Okay. This is a fairly recent change to Tomcat, and frankly, we (a)
> don't get a huge number of outside contributions which include changes
> to the localized properties files (except for the translation-only
> contributions, which have been great!) and (b) often ignore the
> non-English translations in the first place because we are lazy.
> 
> I think maybe this can stay on the back-burner until we see if we end
> up with any problems.

Sounds reasonable to me. It looks like we have options if we need them
but with a few minor issues to research / iron out first if we go that way.

> Does/can "checkstyle" check for valid UTF-8 byte sequences in
> .properties files? I think that may be a helpful check to add if it's
> not already in there.

Don't know. +1 if such a thing exists.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org

Re: UTF-8 properties files and BOMs

Reply via email to