On Tue, Feb 11, 2020 at 4:27 PM Christopher Schultz <
ch...@christopherschultz.net> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> On 2/11/20 2:37 AM, Martin Grigorov wrote:
> > I guess you use Java 8. Newer versions of Java try UTF-8 first and
> > then fallback to ISO-8859-1:
> https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/P
> ropertyResourceBundle.html
> <https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/PropertyResourceBundle.html>
> Correct, I am using Java 8:
>
> $ java -version
> openjdk version "1.8.0_232"
> OpenJDK Runtime Environment (build 1.8.0_232-8u232-b09-1~deb9u1-b09)
> OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)
>
> This is the version that Debian 9 provides. I could install a a higher
> patch-version but would it help?
>
> On 2/11/20 6:38 AM, Mark Thomas wrote:
> > On 10/02/2020 20:58, Christopher Schultz wrote:
> >> All,
> >>
> >> I've recently begun making a change to my application's resource
> >> bundles, converting them into UTF-8 for readability and
> >> converting them to ISO-8859-1 during my build process to make
> >> ResourceBundle happy.
> >>
> >> I have everything working, except that Eclipse still thinks that
> >> my files ought to be ISO-8859-1 and ruins them when I load them.
> >> Sometimes, it's very obvious and that's not a problem: a
> >> developer will see that and fix it before continuing. But some
> >> files are only *slightly* broken by this and someone might make a
> >> mistake.
> >
> > I don't think we have seen this with Tomcat. Or have we (since we
> > switched to UTF-8)?
> >
> > The thing that bugged me was having to manually switch properties
> > files to UTF-8 to view them "properly". You mail motivated me to
> > track down where I can change that in Eclipse:
> >
> > Window->Preferences->General->Content Types
> >
> > and I have changed Java properties files to use UTF-8. So that is
> > my personal niggle fixed. Thanks for the motivation.
>
> Yes, this *will* fix things, but:
>
> 1. It's a global setting, so it can't be set on a per-project basis.
> That means you have to be willing to convert ALL your properties files
> across ALL your projects to UTF-8. That may be okay for some people,
> but not all.
>
> 2. This is a guess: Tomcat's ide-eclipse ant target can't set that
> setting for the Tomcat project(s) because it's a global setting.
> Therefore, anyone using Eclipse as an IDE will have to manually set
> their content-type in order to NOT damage any of the files we ship.
>
> >> NOTE: We don't keep Eclipse settings in revision-control, so I
> >> can't modify everyone's Eclipse configuration. We are using svn
> >> and svn:mime-type is correctly set for these files; Eclipse just
> >> ignores tha t.
> >
> > I've seen that too. While I found it rather annoying, it wasn't
> > annoying enough to try and find a fix as that looked like it would
> > require patching Eclipse and/or the svn plug-in.
> >
> >> Anyway, I found that adding a UTF-8 BOM to the beginning of the
> >> file fixes that issue and Eclipse does the right thing.
> >
> > Ah. So Eclipse *is* doing content scanning. Interesting.
>
> Well, it's not really *content* scanning. But a BOM is the official
> way to tell the difference between a UTF-8 encoded file and one that
> just happens to have a whole bunch of valid UTF-8 byte sequences
> through (most of) the file.
>
> >> As a sanity check. I looked at how Tomcat's files are laid-out
> >> and I don't see any BOMs.
> >
> > Correct. The only files in the code base that should have BOMs at
> > the moment are the ones in the test web application (under
> > bug49nnn) for testing the default Servlet's handling of files with
> > BOMs.
> >
> >> Should we add BOMs? Is there any reason NOT to use a BOM? These
> >> are file types that are officially supposed to be ISO-8859-1 but
> >> everyone wants to handle them differently, so I think adding BOMs
> >> might be a good idea so that editors are always informed of
> >> exactly what's happenin g.
> >>
> >> WDYT?
> >
> > I was concerned that adding a BOM would cause problems when
> > reading property files. I've seen reports of that with Java in the
> > past. A quick test suggests that the issue is no longer present
> > with latest Java 8.
>
> I actually had another problem after I implemented all of this: any
> property file without a blank and/or comment line at the top ended up
> with a mangled and unusable *first* property key. A file like this:
>
> first.property=foo
> second.property=bar
>
> Would end up line this after a trip through "native2ascii -encoding
> UTF-8":
>
> \ufefffirst.property=foo
> second.property=bar
>
> native2ascii stupidly interprets the UTF-8 BOM as an actual character,
> and encodes it in the output.
>
> This appears to be a bug in (at least old versions of) Java and/or
> native2ascii. I've got local installations of Java 8, 11 (Adopt), 11
> (Oracle), and 13 (OpenJDK), and only Java 8 has a "native2ascii"
> binary present. I see ant's <native2ascii> task has its own
> implementation, but it's probably very simple, just like the
> native2ascii program itself. Java's Reader classes incorrectly
> interpret the BOM as an actual character instead of an ignorable UTF-8
> control sequence.
>
> I can confirm that Java 13 still seems to have this problem: running
> ant's <native2ascii> under Java 13 still corrupts the first line of
> the file.
>
> Ensuring that the first line of the file is a comment or a blank line
> fixes things:
>
> # BOM
> first.property=foo
> second.property=bar
>
> becomes:
>
> \ufeff# BOM
> first.property=foo
> second.property=bar
>
> > With the use of POEditor and the import/export scripts we have, it
> > would be unusual for someone to be editing any of the property
> > files where UTF-8 vs ISO-8859-1 matters. Thinking about it a little
> > more, there would be a need to do this to edit non-English strings
> > in the older branches where the key doesn't exist in the latest
> > code. That strikes me as a fairly rare use case.
> >
> > My other worry is that some editors will fail to handle the BOM
> > correctly and we'll end up causing more issues than we solve. I've
> > little basis for that worry other than (possibly out of date)
> > experience.
> >
> > Overall, I guess I am -0 on adding BOMs.
>
> Okay. This is a fairly recent change to Tomcat, and frankly, we (a)
> don't get a huge number of outside contributions which include changes
> to the localized properties files (except for the translation-only
> contributions, which have been great!) and (b) often ignore the
> non-English translations in the first place because we are lazy.
>
> I think maybe this can stay on the back-burner until we see if we end
> up with any problems.
>
> Does/can "checkstyle" check for valid UTF-8 byte sequences in
> .properties files? I think that may be a helpful check to add if it's
> not already in there.
>

Just to add: I am a happy user of XML based properties files (since Java
1.5).
It is relatively simple to roll out XmlPropertyResourceBundle, e.g.
https://gist.github.com/asicfr/1b76ea60029264d7be15d019a866e1a4
This should solve your issues.


>
> - -chris
> -----BEGIN PGP SIGNATURE-----
> Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
>
> iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl5CuasACgkQHPApP6U8
> pFgKWBAAuQiF6fMD+LWDPkdiCWRIYPzPIPjSqHIOvn6iORC/RnJ2S2s8tsvu0K6E
> IVypbd016lOP5Mn1hLGNU80eYPo3xNzz8GrZgjXImG+xeFcZ0VL+FGCkpsE6UrlT
> LuxHi7Axq+sRhxf/iEuTxr/vS9sD5ggc5oc/TnVR1b1NETRX0M43uQFqoraOtHUE
> mCW6KgzqteEu8ca00YH8k73eeCOhIUybFdTXBBaf5VgxT+uQhM0ogIUFkls0KbSE
> sq+SCzIlb1ftSVI1Dp4ORRTH6sjaiBnboZLduJaBbyiqHCIBAwnyO++Qk3RBaWCS
> 4SoOfVF0LFGS5CRG/IZcKMhNctS/NzCa5ShsTFGhaDxqhn+CaaMq9jJlhNb7j1vG
> La/+cSYSp9h63ZohMh5M2r9FbT3nP3q6Tt7N2X40ALGxpMReSf4zF/lV9feHT9wM
> Yq4u6sPO7ACHfL+a4FST1jNPYeLJ4PfiSSv6LY663VZOg06JlVnT0P0SxWKvm7r8
> Y38Guw0m75jWPhM1s0wNGYvQ8t2rCMvjpIIedptmuk9IGyfBux20ms9RGjiir1wB
> BEdL/0opnJALG3qx1ver+vqfWMJbXpyUCnCPgVCPCtnprmSYrdpaif2hiGcIEqG+
> Q5aS3KPvmXN722ORgSXpRn/5Lym2dznMH2alRLbo/Gz/z3g2k4w=
> =T4mh
> -----END PGP SIGNATURE-----
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: dev-h...@tomcat.apache.org
>
>

Reply via email to