[
https://issues.apache.org/jira/browse/SPARK-18076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15602233#comment-15602233
]
Apache Spark commented on SPARK-18076:
--------------------------------------
User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15610
> Fix default Locale used in DateFormat, NumberFormat to Locale.US
> ----------------------------------------------------------------
>
> Key: SPARK-18076
> URL: https://issues.apache.org/jira/browse/SPARK-18076
> Project: Spark
> Issue Type: Bug
> Components: MLlib, Spark Core, SQL
> Affects Versions: 2.0.1
> Reporter: Sean Owen
>
> Many parts of the code use {{DateFormat}} and {{NumberFormat}} instances.
> Although the behavior of these format is mostly determined by things like
> format strings, the exact behavior can vary according to the platform's
> default locale. Although the locale defaults to "en", it can be set to
> something else by env variables. And if it does, it can cause the same code
> to succeed or fail based just on locale:
> {code}
> import java.text._
> import java.util._
> def parse(s: String, l: Locale) = new SimpleDateFormat("yyyyMMMdd",
> l).parse(s)
> parse("1989Dec31", Locale.US)
> Sun Dec 31 00:00:00 GMT 1989
> parse("1989Dec31", Locale.UK)
> Sun Dec 31 00:00:00 GMT 1989
> parse("1989Dec31", Locale.CHINA)
> java.text.ParseException: Unparseable date: "1989Dec31"
> at java.text.DateFormat.parse(DateFormat.java:366)
> at .parse(<console>:18)
> ... 32 elided
> parse("1989Dec31", Locale.GERMANY)
> java.text.ParseException: Unparseable date: "1989Dec31"
> at java.text.DateFormat.parse(DateFormat.java:366)
> at .parse(<console>:18)
> ... 32 elided
> {code}
> Where not otherwise specified, I believe all instances in the code should
> default to some fixed value, and that should probably be {{Locale.US}}. This
> matches the JVM's default, and specifies both language ("en") and region
> ("US") to remove ambiguity. This most closely matches what the current code
> behavior would be (unless default locale was changed), because it will
> currently default to "en".
> This affects SQL date/time functions. At the moment, the only SQL function
> that lets the user specify language/country is "sentences", which is
> consistent with Hive.
> It affects dates passed in the JSON API.
> It affects some strings rendered in the UI, potentially. Although this isn't
> a correctness issue, there may be an argument for not letting that vary (?)
> It affects a bunch of instances where dates are formatted into strings for
> things like IDs or file names, which is far less likely to cause a problem,
> but worth making consistent.
> The other occurrences are in tests.
> The downside to this change is also its upside: the behavior doesn't depend
> on default JVM locale, but, also can't be affected by the default JVM locale.
> For example, if you wanted to parse some dates in a way that depended on an
> non-US locale (not just the format string) then it would no longer be
> possible. There's no means of specifying this, for example, in SQL functions
> for parsing dates. However, controlling this by globally changing the locale
> isn't exactly great either.
> The purpose of this change is to make the current default behavior
> deterministic and fixed. PR coming.
> CC [~hyukjin.kwon]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]