On Tue, Feb 11, 2020 at 4:27 PM Christopher Schultz < ch...@christopherschultz.net> wrote:
> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > On 2/11/20 2:37 AM, Martin Grigorov wrote: > > I guess you use Java 8. Newer versions of Java try UTF-8 first and > > then fallback to ISO-8859-1: > https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/P > ropertyResourceBundle.html > <https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/PropertyResourceBundle.html> > Correct, I am using Java 8: > > $ java -version > openjdk version "1.8.0_232" > OpenJDK Runtime Environment (build 1.8.0_232-8u232-b09-1~deb9u1-b09) > OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode) > > This is the version that Debian 9 provides. I could install a a higher > patch-version but would it help? > > On 2/11/20 6:38 AM, Mark Thomas wrote: > > On 10/02/2020 20:58, Christopher Schultz wrote: > >> All, > >> > >> I've recently begun making a change to my application's resource > >> bundles, converting them into UTF-8 for readability and > >> converting them to ISO-8859-1 during my build process to make > >> ResourceBundle happy. > >> > >> I have everything working, except that Eclipse still thinks that > >> my files ought to be ISO-8859-1 and ruins them when I load them. > >> Sometimes, it's very obvious and that's not a problem: a > >> developer will see that and fix it before continuing. But some > >> files are only *slightly* broken by this and someone might make a > >> mistake. > > > > I don't think we have seen this with Tomcat. Or have we (since we > > switched to UTF-8)? > > > > The thing that bugged me was having to manually switch properties > > files to UTF-8 to view them "properly". You mail motivated me to > > track down where I can change that in Eclipse: > > > > Window->Preferences->General->Content Types > > > > and I have changed Java properties files to use UTF-8. So that is > > my personal niggle fixed. Thanks for the motivation. > > Yes, this *will* fix things, but: > > 1. It's a global setting, so it can't be set on a per-project basis. > That means you have to be willing to convert ALL your properties files > across ALL your projects to UTF-8. That may be okay for some people, > but not all. > > 2. This is a guess: Tomcat's ide-eclipse ant target can't set that > setting for the Tomcat project(s) because it's a global setting. > Therefore, anyone using Eclipse as an IDE will have to manually set > their content-type in order to NOT damage any of the files we ship. > > >> NOTE: We don't keep Eclipse settings in revision-control, so I > >> can't modify everyone's Eclipse configuration. We are using svn > >> and svn:mime-type is correctly set for these files; Eclipse just > >> ignores tha t. > > > > I've seen that too. While I found it rather annoying, it wasn't > > annoying enough to try and find a fix as that looked like it would > > require patching Eclipse and/or the svn plug-in. > > > >> Anyway, I found that adding a UTF-8 BOM to the beginning of the > >> file fixes that issue and Eclipse does the right thing. > > > > Ah. So Eclipse *is* doing content scanning. Interesting. > > Well, it's not really *content* scanning. But a BOM is the official > way to tell the difference between a UTF-8 encoded file and one that > just happens to have a whole bunch of valid UTF-8 byte sequences > through (most of) the file. > > >> As a sanity check. I looked at how Tomcat's files are laid-out > >> and I don't see any BOMs. > > > > Correct. The only files in the code base that should have BOMs at > > the moment are the ones in the test web application (under > > bug49nnn) for testing the default Servlet's handling of files with > > BOMs. > > > >> Should we add BOMs? Is there any reason NOT to use a BOM? These > >> are file types that are officially supposed to be ISO-8859-1 but > >> everyone wants to handle them differently, so I think adding BOMs > >> might be a good idea so that editors are always informed of > >> exactly what's happenin g. > >> > >> WDYT? > > > > I was concerned that adding a BOM would cause problems when > > reading property files. I've seen reports of that with Java in the > > past. A quick test suggests that the issue is no longer present > > with latest Java 8. > > I actually had another problem after I implemented all of this: any > property file without a blank and/or comment line at the top ended up > with a mangled and unusable *first* property key. A file like this: > > first.property=foo > second.property=bar > > Would end up line this after a trip through "native2ascii -encoding > UTF-8": > > \ufefffirst.property=foo > second.property=bar > > native2ascii stupidly interprets the UTF-8 BOM as an actual character, > and encodes it in the output. > > This appears to be a bug in (at least old versions of) Java and/or > native2ascii. I've got local installations of Java 8, 11 (Adopt), 11 > (Oracle), and 13 (OpenJDK), and only Java 8 has a "native2ascii" > binary present. I see ant's <native2ascii> task has its own > implementation, but it's probably very simple, just like the > native2ascii program itself. Java's Reader classes incorrectly > interpret the BOM as an actual character instead of an ignorable UTF-8 > control sequence. > > I can confirm that Java 13 still seems to have this problem: running > ant's <native2ascii> under Java 13 still corrupts the first line of > the file. > > Ensuring that the first line of the file is a comment or a blank line > fixes things: > > # BOM > first.property=foo > second.property=bar > > becomes: > > \ufeff# BOM > first.property=foo > second.property=bar > > > With the use of POEditor and the import/export scripts we have, it > > would be unusual for someone to be editing any of the property > > files where UTF-8 vs ISO-8859-1 matters. Thinking about it a little > > more, there would be a need to do this to edit non-English strings > > in the older branches where the key doesn't exist in the latest > > code. That strikes me as a fairly rare use case. > > > > My other worry is that some editors will fail to handle the BOM > > correctly and we'll end up causing more issues than we solve. I've > > little basis for that worry other than (possibly out of date) > > experience. > > > > Overall, I guess I am -0 on adding BOMs. > > Okay. This is a fairly recent change to Tomcat, and frankly, we (a) > don't get a huge number of outside contributions which include changes > to the localized properties files (except for the translation-only > contributions, which have been great!) and (b) often ignore the > non-English translations in the first place because we are lazy. > > I think maybe this can stay on the back-burner until we see if we end > up with any problems. > > Does/can "checkstyle" check for valid UTF-8 byte sequences in > .properties files? I think that may be a helpful check to add if it's > not already in there. > Just to add: I am a happy user of XML based properties files (since Java 1.5). It is relatively simple to roll out XmlPropertyResourceBundle, e.g. https://gist.github.com/asicfr/1b76ea60029264d7be15d019a866e1a4 This should solve your issues. > > - -chris > -----BEGIN PGP SIGNATURE----- > Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/ > > iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl5CuasACgkQHPApP6U8 > pFgKWBAAuQiF6fMD+LWDPkdiCWRIYPzPIPjSqHIOvn6iORC/RnJ2S2s8tsvu0K6E > IVypbd016lOP5Mn1hLGNU80eYPo3xNzz8GrZgjXImG+xeFcZ0VL+FGCkpsE6UrlT > LuxHi7Axq+sRhxf/iEuTxr/vS9sD5ggc5oc/TnVR1b1NETRX0M43uQFqoraOtHUE > mCW6KgzqteEu8ca00YH8k73eeCOhIUybFdTXBBaf5VgxT+uQhM0ogIUFkls0KbSE > sq+SCzIlb1ftSVI1Dp4ORRTH6sjaiBnboZLduJaBbyiqHCIBAwnyO++Qk3RBaWCS > 4SoOfVF0LFGS5CRG/IZcKMhNctS/NzCa5ShsTFGhaDxqhn+CaaMq9jJlhNb7j1vG > La/+cSYSp9h63ZohMh5M2r9FbT3nP3q6Tt7N2X40ALGxpMReSf4zF/lV9feHT9wM > Yq4u6sPO7ACHfL+a4FST1jNPYeLJ4PfiSSv6LY663VZOg06JlVnT0P0SxWKvm7r8 > Y38Guw0m75jWPhM1s0wNGYvQ8t2rCMvjpIIedptmuk9IGyfBux20ms9RGjiir1wB > BEdL/0opnJALG3qx1ver+vqfWMJbXpyUCnCPgVCPCtnprmSYrdpaif2hiGcIEqG+ > Q5aS3KPvmXN722ORgSXpRn/5Lym2dznMH2alRLbo/Gz/z3g2k4w= > =T4mh > -----END PGP SIGNATURE----- > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org > For additional commands, e-mail: dev-h...@tomcat.apache.org > >