Ah, that is not that unexpected to me. Pandas stores the data as object dtype (to preserve the mixed types, it is up to the user to force a certain conversion), and so for object dtype data you need to do inference to determine which arrow or parquet type to use (in contrast to the other, non-mixed columns, which in pandas use a designated datetime64 dtype). And inference can always be somewhat subjective or incomplete, and here pyarrow and fastparquet seem to make a different decision regarding this inference (where pyarrow is arguably making a wrong decision).
Joris Op do 11 jul. 2019 om 10:04 schreef Zoltan Ivanfi <[email protected]>: > Hi Joris, > > Out of curiosity I tried it with fastparquet as well and that couldn't > even save that column: > > ValueError: Can't infer object conversion type: 0 1970-01-01 > 01:00:00+01:00 > 1 1970-01-01 02:00:00+02:00 > Name: pd_mixed, dtype: object > > Br, > > Zoltan > > On Thu, Jul 11, 2019 at 3:55 PM Joris Van den Bossche > <[email protected]> wrote: > > > > Created an issue for the mixed timezones here: > > https://issues.apache.org/jira/browse/ARROW-5912 > > > > Op do 11 jul. 2019 om 09:42 schreef Joris Van den Bossche < > > [email protected]>: > > > > > Clarification regarding the mixed types (this is in the end not really > > > related to parquet, but to how pandas gets converted to pyarrow) > > > > > > Op do 11 jul. 2019 om 09:17 schreef Zoltan Ivanfi > <[email protected] > > > >: > > > > > >> ... > > >> This matched my expectations up until pd_mixed. I was surprised to see > > >> that timestamps with mixed time zones were be stored using local > > >> semantics instead of being normalized to UTC, > > >> > > > > > > For the actual parquet writing semantics, it is more relevant to look > at > > > the arrow Table that gets created from this DataFrame: > > > > > > In [20]: pa.Table.from_pandas(df) > > > Out[20]: > > > pyarrow.Table > > > datetime: timestamp[ns] > > > pd_no_tz: timestamp[ns] > > > pd_paris: timestamp[ns, tz=Europe/Paris] > > > pd_helsinki: timestamp[ns, tz=Europe/Helsinki] > > > pd_mixed: timestamp[us] > > > > > > For all columns except for pd_mixed the result is clear and expected, > but > > > apparently pyarrow converts to the mixed timestamps to a TimestampArray > > > without timezone using the "local times", and not the UTC normalized > times. > > > > > > Now, that certainly feels a bit buggy to me (or at least unexpected). > But, > > > this is an issue for the python -> arrow conversion, not related to the > > > actual parquet writing. I will open a separate JIRA for this. > > > > > > Joris > > > >
