Sean Owen created SPARK-18076:
---------------------------------
Summary: Fix default Locale used in DateFormat, NumberFormat to
Locale.US
Key: SPARK-18076
URL: https://issues.apache.org/jira/browse/SPARK-18076
Project: Spark
Issue Type: Bug
Components: MLlib, Spark Core, SQL
Affects Versions: 2.0.1
Reporter: Sean Owen
Many parts of the code use {{DateFormat}} and {{NumberFormat}} instances.
Although the behavior of these format is mostly determined by things like
format strings, the exact behavior can vary according to the platform's default
locale. Although the locale defaults to "en", it can be set to something else
by env variables. And if it does, it can cause the same code to succeed or fail
based just on locale:
{code}
import java.text._
import java.util._
def parse(s: String, l: Locale) = new SimpleDateFormat("yyyyMMMdd", l).parse(s)
parse("1989Dec31", Locale.US)
Sun Dec 31 00:00:00 GMT 1989
parse("1989Dec31", Locale.UK)
Sun Dec 31 00:00:00 GMT 1989
parse("1989Dec31", Locale.CHINA)
java.text.ParseException: Unparseable date: "1989Dec31"
at java.text.DateFormat.parse(DateFormat.java:366)
at .parse(<console>:18)
... 32 elided
parse("1989Dec31", Locale.GERMANY)
java.text.ParseException: Unparseable date: "1989Dec31"
at java.text.DateFormat.parse(DateFormat.java:366)
at .parse(<console>:18)
... 32 elided
{code}
Where not otherwise specified, I believe all instances in the code should
default to some fixed value, and that should probably be {{Locale.US}}. This
matches the JVM's default, and specifies both language ("en") and region ("US")
to remove ambiguity. This most closely matches what the current code behavior
would be (unless default locale was changed), because it will currently default
to "en".
This affects SQL date/time functions. At the moment, the only SQL function that
lets the user specify language/country is "sentences", which is consistent with
Hive.
It affects dates passed in the JSON API.
It affects some strings rendered in the UI, potentially. Although this isn't a
correctness issue, there may be an argument for not letting that vary (?)
It affects a bunch of instances where dates are formatted into strings for
things like IDs or file names, which is far less likely to cause a problem, but
worth making consistent.
The other occurrences are in tests.
The downside to this change is also its upside: the behavior doesn't depend on
default JVM locale, but, also can't be affected by the default JVM locale. For
example, if you wanted to parse some dates in a way that depended on an non-US
locale (not just the format string) then it would no longer be possible.
There's no means of specifying this, for example, in SQL functions for parsing
dates. However, controlling this by globally changing the locale isn't exactly
great either.
The purpose of this change is to make the current default behavior
deterministic and fixed. PR coming.
CC [~hyukjin.kwon]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]