[ 
https://issues.apache.org/jira/browse/SPARK-54697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naresh P R updated SPARK-54697:
-------------------------------
    Description: 
eg., 
{code:java}
create external table test_calendar (writerType string, inputDate date) stored 
as parquet;
INSERT INTO test.test_calendar values('spark-corrected', CAST('0685-04-12' AS 
DATE)), ('spark-corrected', CAST('1582-10-04' AS DATE)); {code}
Hive writes a flag in parquet metadata ({*}writer.date.proleptic{*}) which 
helps Hive-Parquet readers to decide whether the date is in hybrid or 
proleptic. *hive.parquet.date.proleptic.gregorian* is used in writer flow which 
adds *writer.date.proleptic* = true/false on the parquet file metadata.

 

Setting *hive.parquet.date.proleptic.gregorian=true/false* while reading the 
files doesn’t not have any effect, Hive parquet read depends on 
*writer.date.proleptic* file specific metadata config on each individual file.

 

Its better if Spark can comply with Hive *writer.date.proleptic* meta config. 
(ie., Spark writer should add writer.date.proleptic=true/false in parquet file 
metadata and consider the same metadata config while reading in spark instead 
of relying on spark.sql.parquet.datetimeRebaseModeInRead/ 
spark.sql.parquet.datetimeRebaseModeInWrite as LEGACY/CORRECTED. Or have a 
better a common ground so that all reads know whether the dates are Hybrid or 
Gregorian.

 

Without this common ground, Hive written files will show wrong results in Spark 
& Spark written files will show wrong results in Hive.

  was:
eg., 
{code:java}
create external table test_calendar (writerType string, inputDate date) stored 
as parquet;
INSERT INTO test.test_calendar values('spark-corrected', CAST('0685-04-12' AS 
DATE)), ('spark-corrected', CAST('1582-10-04' AS DATE)); {code}
Hive writes a flag in parquet metadata ({*}writer.date.proleptic{*}) which 
helps Hive-Parquet readers to decide whether the date is in hybrid or 
proleptic. *hive.parquet.date.proleptic.gregorian* is used in writer flow which 
adds *writer.date.proleptic* = true/false on the parquet file metadata.

 

Setting *hive.parquet.date.proleptic.gregorian=true/false* while reading the 
files doesn’t not have any effect, Hive parquet read depends on 
*writer.date.proleptic* file specific metadata config on each individual file.

 

Its better if Spark can comply with Hive *writer.date.proleptic* meta config. 
(ie., Spark writer should add writer.date.proleptic=true/false in parquet file 
metadata and consider the same metadata config while reading in spark instead 
of relying on spark.sql.parquet.datetimeRebaseModeInRead/ 
spark.sql.parquet.datetimeRebaseModeInWrite as LEGACY/CORRECTED. Or have a 
better a common ground so that all reads know whether the dates are Julian or 
Gregorian.

 

Without this common ground, Hive written files will show wrong results in Spark 
& Spark written files will show wrong results in Hive.


> Read/Write proleptic dates older than 1582-10-04 via Hive/Spark for 
> interoperability
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-54697
>                 URL: https://issues.apache.org/jira/browse/SPARK-54697
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.5.7
>            Reporter: Naresh P R
>            Priority: Major
>
> eg., 
> {code:java}
> create external table test_calendar (writerType string, inputDate date) 
> stored as parquet;
> INSERT INTO test.test_calendar values('spark-corrected', CAST('0685-04-12' AS 
> DATE)), ('spark-corrected', CAST('1582-10-04' AS DATE)); {code}
> Hive writes a flag in parquet metadata ({*}writer.date.proleptic{*}) which 
> helps Hive-Parquet readers to decide whether the date is in hybrid or 
> proleptic. *hive.parquet.date.proleptic.gregorian* is used in writer flow 
> which adds *writer.date.proleptic* = true/false on the parquet file metadata.
>  
> Setting *hive.parquet.date.proleptic.gregorian=true/false* while reading the 
> files doesn’t not have any effect, Hive parquet read depends on 
> *writer.date.proleptic* file specific metadata config on each individual file.
>  
> Its better if Spark can comply with Hive *writer.date.proleptic* meta config. 
> (ie., Spark writer should add writer.date.proleptic=true/false in parquet 
> file metadata and consider the same metadata config while reading in spark 
> instead of relying on spark.sql.parquet.datetimeRebaseModeInRead/ 
> spark.sql.parquet.datetimeRebaseModeInWrite as LEGACY/CORRECTED. Or have a 
> better a common ground so that all reads know whether the dates are Hybrid or 
> Gregorian.
>  
> Without this common ground, Hive written files will show wrong results in 
> Spark & Spark written files will show wrong results in Hive.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to