[
https://issues.apache.org/jira/browse/ARROW-9502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173209#comment-17173209
]
Joris Van den Bossche commented on ARROW-9502:
----------------------------------------------
Such a conversion will indeed happen on the write path because Parquet has only
a single DATE type, which is equivalent to date32[day] (see
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date).
The question is then when _reading_, should we simply always use the
{{date32[day]}} type, or should we check the Arrow schema that is saved in the
parquet file's metadata to see which date type was used originally?
I think both have its pros/cons: checking the metadata can indeed give a fully
faithful roundtrip, but on the other hand will then also give a conversion step
when reading it into {{date64[ms]}} (while reading it as {{date32[day]}} needs
no transformation once deserialized).
> [Python][C++] Date64 converted to Date32 on parquet
> ---------------------------------------------------
>
> Key: ARROW-9502
> URL: https://issues.apache.org/jira/browse/ARROW-9502
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Reporter: Jorge
> Priority: Major
>
> Executing the example below,
> {code:python}
> import datetime
> import pyarrow as pa
> import pyarrow.parquet
> data = [
> datetime.datetime(2000, 1, 1, 12, 34, 56, 123456),
> datetime.datetime(2000, 1, 1)
> ]
> data32 = pa.array(data, type='date32')
> data64 = pa.array(data, type='date64')
> table = pyarrow.Table.from_arrays([data32, data64], names=['a', 'b'])
> pyarrow.parquet.write_table(table, 'a.parquet')
> print(table)
> print()
> print(pyarrow.parquet.read_table('a.parquet'))
> {code}
> yields
> {code:java}
> pyarrow.Table
> a: date32[day]
> b: date64[ms]
> pyarrow.Table
> a: date32[day]
> b: date32[day] <------- IMO it should be date64[ms]
> {code}
> indicating that pyarrow converted its date64[ms] schema to date32[day]. I
> used the rust crate to print parquet's metadata, and the value is indeed
> stored as i32, which suggests that this likely happens on the writer, not
> reader.
> IMO this does not have any practical implication because they are both dates
> and a 32 bit date (in days) can hold more dates than a 64 bit date in
> milliseconds, but still constitutes an error as the roundtrip serialization
> does not yield the same schema.
> A broader question I have is why data64 exists in the first place? I can't
> see any reason to store a *date* in milliseconds since EPOCH.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)