Ah, that is not that unexpected to me. Pandas stores the data as object
dtype (to preserve the mixed types, it is up to the user to force a certain
conversion), and so for object dtype data you need to do inference to
determine which arrow or parquet type to use (in contrast to the other,
non-mixed columns, which in pandas use a designated datetime64 dtype).
And inference can always be somewhat subjective or incomplete, and here
pyarrow and fastparquet seem to make a different decision regarding this
inference (where pyarrow is arguably making a wrong decision).

Joris

Op do 11 jul. 2019 om 10:04 schreef Zoltan Ivanfi <[email protected]>:

> Hi Joris,
>
> Out of curiosity I tried it with fastparquet as well and that couldn't
> even save that column:
>
> ValueError: Can't infer object conversion type: 0    1970-01-01
> 01:00:00+01:00
> 1    1970-01-01 02:00:00+02:00
> Name: pd_mixed, dtype: object
>
> Br,
>
> Zoltan
>
> On Thu, Jul 11, 2019 at 3:55 PM Joris Van den Bossche
> <[email protected]> wrote:
> >
> > Created an issue for the mixed timezones here:
> > https://issues.apache.org/jira/browse/ARROW-5912
> >
> > Op do 11 jul. 2019 om 09:42 schreef Joris Van den Bossche <
> > [email protected]>:
> >
> > > Clarification regarding the mixed types (this is in the end not really
> > > related to parquet, but to how pandas gets converted to pyarrow)
> > >
> > > Op do 11 jul. 2019 om 09:17 schreef Zoltan Ivanfi
> <[email protected]
> > > >:
> > >
> > >> ...
> > >> This matched my expectations up until pd_mixed. I was surprised to see
> > >> that timestamps with mixed time zones were be stored using local
> > >> semantics instead of being normalized to UTC,
> > >>
> > >
> > > For the actual parquet writing semantics, it is more relevant to look
> at
> > > the arrow Table that gets created from this DataFrame:
> > >
> > > In [20]: pa.Table.from_pandas(df)
> > > Out[20]:
> > > pyarrow.Table
> > > datetime: timestamp[ns]
> > > pd_no_tz: timestamp[ns]
> > > pd_paris: timestamp[ns, tz=Europe/Paris]
> > > pd_helsinki: timestamp[ns, tz=Europe/Helsinki]
> > > pd_mixed: timestamp[us]
> > >
> > > For all columns except for pd_mixed the result is clear and expected,
> but
> > > apparently pyarrow converts to the mixed timestamps to a TimestampArray
> > > without timezone using the "local times", and not the UTC normalized
> times.
> > >
> > > Now, that certainly feels a bit buggy to me (or at least unexpected).
> But,
> > > this is an issue for the python -> arrow conversion, not related to the
> > > actual parquet writing. I will open a separate JIRA for this.
> > >
> > > Joris
> > >
>

Reply via email to