Sean Owen created SPARK-18076:
---------------------------------

             Summary: Fix default Locale used in DateFormat, NumberFormat to 
Locale.US
                 Key: SPARK-18076
                 URL: https://issues.apache.org/jira/browse/SPARK-18076
             Project: Spark
          Issue Type: Bug
          Components: MLlib, Spark Core, SQL
    Affects Versions: 2.0.1
            Reporter: Sean Owen


Many parts of the code use {{DateFormat}} and {{NumberFormat}} instances. 
Although the behavior of these format is mostly determined by things like 
format strings, the exact behavior can vary according to the platform's default 
locale. Although the locale defaults to "en", it can be set to something else 
by env variables. And if it does, it can cause the same code to succeed or fail 
based just on locale:

{code}
import java.text._
import java.util._

def parse(s: String, l: Locale) = new SimpleDateFormat("yyyyMMMdd", l).parse(s)

parse("1989Dec31", Locale.US)
Sun Dec 31 00:00:00 GMT 1989

parse("1989Dec31", Locale.UK)
Sun Dec 31 00:00:00 GMT 1989

parse("1989Dec31", Locale.CHINA)
java.text.ParseException: Unparseable date: "1989Dec31"
  at java.text.DateFormat.parse(DateFormat.java:366)
  at .parse(<console>:18)
  ... 32 elided

parse("1989Dec31", Locale.GERMANY)
java.text.ParseException: Unparseable date: "1989Dec31"
  at java.text.DateFormat.parse(DateFormat.java:366)
  at .parse(<console>:18)
  ... 32 elided
{code}

Where not otherwise specified, I believe all instances in the code should 
default to some fixed value, and that should probably be {{Locale.US}}. This 
matches the JVM's default, and specifies both language ("en") and region ("US") 
to remove ambiguity. This most closely matches what the current code behavior 
would be (unless default locale was changed), because it will currently default 
to "en".

This affects SQL date/time functions. At the moment, the only SQL function that 
lets the user specify language/country is "sentences", which is consistent with 
Hive.

It affects dates passed in the JSON API. 

It affects some strings rendered in the UI, potentially. Although this isn't a 
correctness issue, there may be an argument for not letting that vary (?)

It affects a bunch of instances where dates are formatted into strings for 
things like IDs or file names, which is far less likely to cause a problem, but 
worth making consistent.

The other occurrences are in tests.


The downside to this change is also its upside: the behavior doesn't depend on 
default JVM locale, but, also can't be affected by the default JVM locale. For 
example, if you wanted to parse some dates in a way that depended on an non-US 
locale (not just the format string) then it would no longer be possible. 
There's no means of specifying this, for example, in SQL functions for parsing 
dates. However, controlling this by globally changing the locale isn't exactly 
great either.

The purpose of this change is to make the current default behavior 
deterministic and fixed. PR coming.

CC [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to