[jira] [Updated] (SPARK-36934) Timestamp are written as array bytes.

Hyukjin Kwon (Jira) Tue, 05 Oct 2021 23:47:09 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-36934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon updated SPARK-36934:
---------------------------------
    Description: 
This is tested with master build 04.10.21

{code}
df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'],
                   'month': [2, 3],
                   'day': [4, 5],
                  'test': [1, 2]})  

df["year"] = ps.to_datetime(df["year"]) 

df.info() 

<class 'pyspark.pandas.frame.DataFrame'> Int64Index: 2 entries, 0 to 1 Data 
columns (total 4 columns): # Column Non-Null Count Dtype --- ------ 
-------------- ----- 0 year 2 non-null datetime64 1 month 2 non-null int64 2 
day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3)  


spark_df_date = df.to_spark() 

spark_df_date.printSchema() 

root
|-- year: timestamp (nullable = true)
|-- month: long (nullable = false)
|-- day: long (nullable = false)
|-- test: long (nullable = false)  

spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet")  
{code}

Load the files in to Apache drill I use docker apache/drill:master-openjdk-14  

SELECT * FROM cp.`/data/spark_df_date.*`  

It print's

year

{code}
\x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00

\x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00 
{code}
 

The rest of the columns are ok.   
So is this a spark problem or Apache drill? 

  was:
This is tested with master build 04.10.21

df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'],
                   'month': [2, 3],
                   'day': [4, 5],
                  'test': [1, 2]})  

 

df["year"] = ps.to_datetime(df["year"]) 

df.info() 

<class 'pyspark.pandas.frame.DataFrame'> Int64Index: 2 entries, 0 to 1 Data 
columns (total 4 columns): # Column Non-Null Count Dtype --- ------ 
-------------- ----- 0 year 2 non-null datetime64 1 month 2 non-null int64 2 
day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3)  

 

spark_df_date = df.to_spark() 

 

spark_df_date.printSchema() 

root |-- year: timestamp (nullable = true) |-- month: long (nullable = false) 
|-- day: long (nullable = false) |-- test: long (nullable = false)  

spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet")  

Load the files in to Apache drill I use docker apache/drill:master-openjdk-14  

SELECT * FROM cp.`/data/spark_df_date.*`  

It print's

year

\x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00

\x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00 

 

The rest of the columns are ok.   
So is this a spark problem or Apache drill? 


> Timestamp are written as array bytes.
> -------------------------------------
>
>                 Key: SPARK-36934
>                 URL: https://issues.apache.org/jira/browse/SPARK-36934
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.3.0
>            Reporter: Bjørn Jørgensen
>            Priority: Major
>
> This is tested with master build 04.10.21
> {code}
> df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'],
>                    'month': [2, 3],
>                    'day': [4, 5],
>                   'test': [1, 2]})  
> df["year"] = ps.to_datetime(df["year"]) 
> df.info() 
> <class 'pyspark.pandas.frame.DataFrame'> Int64Index: 2 entries, 0 to 1 Data 
> columns (total 4 columns): # Column Non-Null Count Dtype --- ------ 
> -------------- ----- 0 year 2 non-null datetime64 1 month 2 non-null int64 2 
> day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3)  
> spark_df_date = df.to_spark() 
> spark_df_date.printSchema() 
> root
> |-- year: timestamp (nullable = true)
> |-- month: long (nullable = false)
> |-- day: long (nullable = false)
> |-- test: long (nullable = false)  
> spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet")  
> {code}
> Load the files in to Apache drill I use docker apache/drill:master-openjdk-14 
>  
> SELECT * FROM cp.`/data/spark_df_date.*`  
> It print's
> year
> {code}
> \x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00
> \x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00 
> {code}
>  
> The rest of the columns are ok.   
> So is this a spark problem or Apache drill? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36934) Timestamp are written as array bytes.

Reply via email to