[jira] [Commented] (ARROW-9502) [Python][C++] Date64 converted to Date32 on parquet

Joris Van den Bossche (Jira) Fri, 07 Aug 2020 08:28:20 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-9502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173209#comment-17173209
 ]


Joris Van den Bossche commented on ARROW-9502:
----------------------------------------------

Such a conversion will indeed happen on the write path because Parquet has only 
a single DATE type, which is equivalent to date32[day] (see 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date).

The question is then when _reading_, should we simply always use the 
{{date32[day]}} type, or should we check the Arrow schema that is saved in the 
parquet file's metadata to see which date type was used originally? 

I think both have its pros/cons: checking the metadata can indeed give a fully 
faithful roundtrip, but on the other hand will then also give a conversion step 
when reading it into {{date64[ms]}} (while reading it as {{date32[day]}} needs 
no transformation once deserialized).


> [Python][C++] Date64 converted to Date32 on parquet
> ---------------------------------------------------
>
>                 Key: ARROW-9502
>                 URL: https://issues.apache.org/jira/browse/ARROW-9502
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>            Reporter: Jorge
>            Priority: Major
>
> Executing the example below, 
> {code:python}
> import datetime
> import pyarrow as pa
> import pyarrow.parquet
> data = [
>     datetime.datetime(2000, 1, 1, 12, 34, 56, 123456), 
>     datetime.datetime(2000, 1, 1)
> ]
> data32 = pa.array(data, type='date32')
> data64 = pa.array(data, type='date64')
> table = pyarrow.Table.from_arrays([data32, data64], names=['a', 'b'])
> pyarrow.parquet.write_table(table, 'a.parquet')
> print(table)
> print()
> print(pyarrow.parquet.read_table('a.parquet'))
> {code}
> yields
> {code:java}
> pyarrow.Table
> a: date32[day]
> b: date64[ms]
> pyarrow.Table
> a: date32[day]
> b: date32[day]   <------- IMO it should be date64[ms]
> {code}
> indicating that pyarrow converted its date64[ms] schema to date32[day]. I 
> used the rust crate to print parquet's metadata, and the value is indeed 
> stored as i32, which suggests that this likely happens on the writer, not 
> reader.
> IMO this does not have any practical implication because they are both dates 
> and a 32 bit date (in days) can hold more dates than a 64 bit date in 
> milliseconds, but still constitutes an error as the roundtrip serialization 
> does not yield the same schema.
> A broader question I have is why data64 exists in the first place? I can't 
> see any reason to store a *date* in milliseconds since EPOCH.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9502) [Python][C++] Date64 converted to Date32 on parquet

Reply via email to