[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812838#comment-16812838
 ] 

shane knapp commented on SPARK-27389:
-------------------------------------

well, according to [~bryanc]:

"""
>From the stacktrace, it looks like it's getting this from 
>"spark.sql.session.timeZone" which defaults to Java.util 
>TimeZone.getDefault.getID()
"""

here are the versions of tzdata* installed on the workers having this problem:
{noformat}
tzdata-2019a-1.el6.noarch
tzdata-java-2019a-1.el6.noarch
{noformat}

looks like we're on the latest, but US/Pacific-New is STILL showing up in 
/usr/share/zoneinfo/US.  

when i dig in to the java tzdata package, i am finding the following:

{noformat}
$ strings /usr/share/javazi/ZoneInfoMappings
...bunch of cruft deleted...
US/Pacific
America/Los_Angeles
US/Pacific-New
America/Los_Angeles
{noformat}

so, it appears to me that:
1) the OS still sees US/Pacific-New via tzdata
2) java still sees US/Pacific-New via tzdata-java
3) python has no idea WTF US/Pacific-New is and (occasionally) barfs during 
pyspark unit tests

so, should i go ahead and manually hack 
lib/python2.7/site-packages/pytz/__init__.py and add 'US/Pacific-New' which 
will fix the symptom (w/o fixing the cause)?

other than doing that, i'm actually stumped as to why this literally just 
started failing.

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -----------------------------------------------------------------
>
>                 Key: SPARK-27389
>                 URL: https://issues.apache.org/jira/browse/SPARK-27389
>             Project: Spark
>          Issue Type: Task
>          Components: jenkins, PySpark
>    Affects Versions: 3.0.0
>            Reporter: Imran Rashid
>            Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ======================================================================
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
>     pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
>     return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
>     _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
>     return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
>     lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
>     mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in <lambda>
>     if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
>     raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to