[jira] [Updated] (SPARK-34195) Base value for parsing two-digit year should be made configurable

Anthony Smart (Jira) Thu, 21 Jan 2021 12:02:09 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-34195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Anthony Smart updated SPARK-34195:
----------------------------------
    Description: 
The base value is set as 2000 within spark for parsing a two-digit year date 
string. If we try to parse "10-JAN-97" then this will be interpreted as 2097 
instead of 1997.

I'm unclear as to why this base value has been changed within spark as the 
standard python datetime module instead uses a more sensible value of 69 as the 
boundary cut-off for determining the correct century to apply.

Reference: [https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html]

Other libraries e.g. .NET Core will use 29 as the boundary cut-off. But a base 
value of 2000 is rather non-functional indeed. Most dates encountered in the 
real world will pertain to both centuries and therefore I propose this 
functionality is reverted to match the existing python datetime module and / or 
allow the base value to be set as an option to the various date functions. This 
would ensure there's consistent behaviour across both python and pyspark.

 

Python:

{{import datetime}}
 {{datetime.datetime.strptime('10-JAN-69', '%d-%b-%y').date()}}
 Out[118]: datetime.date(1969, 1, 10)
  
 {{Pyspark:}}{{import spark.sql.functions as F}}
 {{df = spark.createDataFrame([('10-JAN-69',)], ['dt'])}}
 {{df.withColumn("newdate", F.from_unixtime(F.unix_timestamp("dt", 
"dd-MMM-yy"), "dd-MM-yyyy")).collect()}}
  
 Out[117]: [Row(dt='10-JAN-70', newdate='10-01-2069')]
  
  
  
 As a work around I had to write my own solution to deal with this. The code 
below is specific to my data pipeline but you get the idea of the issue I had 
to deal with just to change the boundary cut-off to better handle two-digit 
years.
  
 {{from pyspark.sql.functions import to_date, col, trim}}{{def 
convert_dtypes(entity, schema, boundary="40"):}}
 \{{ cols = []}}
 {{ for x in schema[entity]:}}
 \{{ for c in std_df.columns:}}
 {{ if x['name'] == c:}}
 {{ if x['dtype'] == 'date':}}
 \{{ dd = F.substring(c, 1, 2)}}
 \{{ MMM = F.substring(c, 4, 3)}}
 \{{ yy = F.substring(c, 8, 2)}}
 \{{ n = (}}
 \{{ F.when(trim(col(c)) == "", None).otherwise(}}
 \{{ F.when(yy >= ("40"), }}
 {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("19"), yy)).otherwise(}}
 {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("20"), yy)))}}
 \{{ )}}
 \{{ cols.append(to_date(n, 'dd-MMM-yyyy').alias(c))}}
 \{{ else:}}
 {{ cols.append(col(c).cast(x['dtype']))}}
 {{ #cols[-1].nullable = x['nullable']}}
 \{{ return std_df.select(*cols)}}
  

  was:
The base value is set as 2000 within spark for parsing a two-digit year date 
string. If we try to parse "10-JAN-97" then this will be interpreted as 2097 
instead of 1997.

I'm unclear as to why this base value has been changed within spark as the 
standard python datetime module instead uses a more sensible value of 69 as the 
boundary cut-off for determining the correct century to apply.

Reference: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

Other libraries e.g. .NET Core will use 29 as the boundary cut-off. But a base 
value of 2000 is rather non-functional indeed. Most dates encountered in the 
real world will pertain to both centuries and therefore I propose this 
functionality is reverted to match the existing python datetime module and / or 
allow the base value to be set as an option to the various date functions. This 
would ensure there's consistent behaviour across both python and pyspark.

 

Python:

{{import datetime}}
{{datetime.datetime.strptime('10-JAN-69', '%d-%b-%y').date()}}
Out[118]: datetime.date(1969, 1, 10)
 
{{Pyspark:}}{{import spark.sql.functions as F}}
{{df = spark.createDataFrame([('10-JAN-69',)], ['dt'])}}
{{df.withColumn("newdate", F.from_unixtime(F.unix_timestamp("dt", "dd-MMM-yy"), 
"dd-MM-yy")).collect()}}
 
Out[117]: [Row(dt='10-JAN-70', newdate='10-01-2069')]
 
 
 
As a work around I had to write my own solution to deal with this. The code 
below is specific to my data pipeline but you get the idea of the issue I had 
to deal with just to change the boundary cut-off to better handle two-digit 
years.
 
{{from pyspark.sql.functions import to_date, col, trim}}{{def 
convert_dtypes(entity, schema, boundary="40"):}}
{{ cols = []}}
{{ for x in schema[entity]:}}
{{ for c in std_df.columns:}}
{{ if x['name'] == c:}}
{{ if x['dtype'] == 'date':}}
{{ dd = F.substring(c, 1, 2)}}
{{ MMM = F.substring(c, 4, 3)}}
{{ yy = F.substring(c, 8, 2)}}
{{ n = (}}
{{ F.when(trim(col(c)) == "", None).otherwise(}}
{{ F.when(yy >= ("40"), }}
{{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("19"), yy)).otherwise(}}
{{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("20"), yy)))}}
{{ )}}
{{ cols.append(to_date(n, 'dd-MMM-yyyy').alias(c))}}
{{ else:}}
{{ cols.append(col(c).cast(x['dtype']))}}
{{ #cols[-1].nullable = x['nullable']}}
{{ return std_df.select(*cols)}}
 


> Base value for parsing two-digit year should be made configurable
> -----------------------------------------------------------------
>
>                 Key: SPARK-34195
>                 URL: https://issues.apache.org/jira/browse/SPARK-34195
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 3.0.1
>            Reporter: Anthony Smart
>            Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The base value is set as 2000 within spark for parsing a two-digit year date 
> string. If we try to parse "10-JAN-97" then this will be interpreted as 2097 
> instead of 1997.
> I'm unclear as to why this base value has been changed within spark as the 
> standard python datetime module instead uses a more sensible value of 69 as 
> the boundary cut-off for determining the correct century to apply.
> Reference: 
> [https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html]
> Other libraries e.g. .NET Core will use 29 as the boundary cut-off. But a 
> base value of 2000 is rather non-functional indeed. Most dates encountered in 
> the real world will pertain to both centuries and therefore I propose this 
> functionality is reverted to match the existing python datetime module and / 
> or allow the base value to be set as an option to the various date functions. 
> This would ensure there's consistent behaviour across both python and pyspark.
>  
> Python:
> {{import datetime}}
>  {{datetime.datetime.strptime('10-JAN-69', '%d-%b-%y').date()}}
>  Out[118]: datetime.date(1969, 1, 10)
>   
>  {{Pyspark:}}{{import spark.sql.functions as F}}
>  {{df = spark.createDataFrame([('10-JAN-69',)], ['dt'])}}
>  {{df.withColumn("newdate", F.from_unixtime(F.unix_timestamp("dt", 
> "dd-MMM-yy"), "dd-MM-yyyy")).collect()}}
>   
>  Out[117]: [Row(dt='10-JAN-70', newdate='10-01-2069')]
>   
>   
>   
>  As a work around I had to write my own solution to deal with this. The code 
> below is specific to my data pipeline but you get the idea of the issue I had 
> to deal with just to change the boundary cut-off to better handle two-digit 
> years.
>   
>  {{from pyspark.sql.functions import to_date, col, trim}}{{def 
> convert_dtypes(entity, schema, boundary="40"):}}
>  \{{ cols = []}}
>  {{ for x in schema[entity]:}}
>  \{{ for c in std_df.columns:}}
>  {{ if x['name'] == c:}}
>  {{ if x['dtype'] == 'date':}}
>  \{{ dd = F.substring(c, 1, 2)}}
>  \{{ MMM = F.substring(c, 4, 3)}}
>  \{{ yy = F.substring(c, 8, 2)}}
>  \{{ n = (}}
>  \{{ F.when(trim(col(c)) == "", None).otherwise(}}
>  \{{ F.when(yy >= ("40"), }}
>  {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("19"), yy)).otherwise(}}
>  {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("20"), yy)))}}
>  \{{ )}}
>  \{{ cols.append(to_date(n, 'dd-MMM-yyyy').alias(c))}}
>  \{{ else:}}
>  {{ cols.append(col(c).cast(x['dtype']))}}
>  {{ #cols[-1].nullable = x['nullable']}}
>  \{{ return std_df.select(*cols)}}
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34195) Base value for parsing two-digit year should be made configurable

Reply via email to