Good question. I'll take a stab at answering some of it. C++ has the same passthru / interoperability concerns. Python is significant as it's builtin datetime module distinguishes between "local" and "instant" datetimes (which it calls naive and non-naive). In addition, pandas which has a very similar representation (e.g. timestamp column with a single time zone string). Pyarrow current supports interoperability with both. So if you get a timestamp array from pandas with a time zone string pyarrow will convert to a timestamp column with a timezone string and vice versa. Wes & Joris could probably give you a better answer how Pandas actually uses the time zone string. There is also interoperability with parquet. Parquet does not support an arbitrary time zone string (my guess is arrow is using metadata for that piece) but it does support a distinction between local/instant logic and arrow uses (timezone string == null or empty) to populate that field.
Second, some of the compute kernels are having time zone aware logic added in ARROW-12980 so, for example, if you read in a column of unix epochs (as int64) from a parquet file and you wanted to display them as strings in your local time zone without leaving Arrow C++ you could do something (roughly) like... Parquet -> INT64 -> Cast(Timestamp, "Insert Local Timezone") -> strftime. Although in that particular case you could argue (and I think I might personally prefer) that "Insert Local Timzeone" could instead be an argument passed into strftime. Perhaps Joris & Rok could comment more as I think they've been working in this area. I believe there are plans for the compute kernel to also forbid certain operations based on local vs instant semantics. For example, Cast(Timestamp-UTC -> Timestamp-MST) is ok and Cast(Timestamp-MST -> Timestamp-EST) is ok but Cast(Timestamp-None -> Timestamp-MST) is NOT ok (although there is a localize kernel if you know that is what you want to do and you're agreeing to the risks). Similarly "instants" can be compared amongst themselves and "local times" can be compared amongst themselves but "instants" cannot be compared with "local times". -Weston On Wed, Jul 7, 2021 at 3:04 PM Evan Chan <e...@urbanlogiq.com> wrote: > > Thanks everyone for their input; > > Interoperability would be the biggest issue; how much does C++ do with the > timezone string? > > -Evan > > > On Jul 7, 2021, at 1:33 PM, Weston Pace <weston.p...@gmail.com> wrote: > > > > I don't know about removal but you could probably ignore the timezone > > string and it's not clear the issues would be that significant. > > > > If Rust never produces a non-null non-UTC timestamp then I don't see > > that as an issue. > > > > If you are consuming data with a timestamp string other than UTC it > > isn't really clear what information that timestamp string is supposed > > to convey anyways. Are you supposed to extract fields as if you were > > in that time zone? Or does this indicate the time zone the data was > > captured in? Postgresql, etc. do not support this concept. Probably > > the safest thing to do would be to reject the data. > > > > There still remains the question of whether or not you need to > > distinguish between local times and instant times. Or, in python > > terms, naive vs non-naive. Or, in parquet terms, whether you need to > > worry about the isAdjustedToUtc flag. Or, in postgres terms, whether > > you need to distinguish between "timestamp with timezone" and > > "timestamp without timezone". > > > > This boils down to whether you want to support the constraints offered > > by these semantic hints from the user or not. For example, forbidding > > comparison between the two types of timestamps or altering how you > > display them. If those features are not important, then Rust could > > ignore the time zone field completely. That could cause an > > interoperability issue though (e.g. data going into rust with timezone > > UTC comes back out with no timezone even though nothing changed). > > Ideally rust could ignore the time zone string but leave it unchanged. > > > > On Wed, Jul 7, 2021 at 6:58 AM Joris Van den Bossche > > <jorisvandenboss...@gmail.com> wrote: > >> > >> On Wed, 7 Jul 2021 at 18:46, Jorge Cardoso Leitão > >> <jorgecarlei...@gmail.com> > >> wrote: > >> > >>> Hi, > >>> > >>> AFAIK timezone is part of the spec. > >> > >> > >> And for reference, the current spec (Schema flatbuffer file) for timestamp > >> is at > >> https://github.com/apache/arrow/blob/6c8d30ea82222fd2750b999840872d3f6cbdc8f8/format/Schema.fbs#L217-L247. > >> > >> > >> > >>> In Python, that would be [1] > >>> > >>> import pyarrow as pa > >>> dt1 = pa.timestamp("ms", "+00:10") > >>> dt2 = pa.timestamp("ms") > >>> > >>> arrow-rs is not very consistent with how it handles it. imo that is an > >>> artifact of being currently difficult (API wise) to create an array with a > >>> timezone, which have caused people to not use it much (and thus not > >>> implement kernels with it / test it properly). > >>> > >>> I do not see how removing it would be compatible with the Arrow spec, > >>> though. > >>> > >>> Best, > >>> Jorge > >>> > >>> [1] https://arrow.apache.org/docs/python/generated/pyarrow.timestamp.html > >>> > >>> > >>> > >>> On Wed, Jul 7, 2021 at 6:37 PM Evan Chan <e...@urbanlogiq.com> wrote: > >>> > >>>> Hi folks, > >>>> > >>>> Some of us are having a discussion about a direction change for Rust > >>> Arrow > >>>> timestamp types, which current support both a resolution field (Ns, > >>> Micros, > >>>> Ms, Seconds) similar to the other language implementations, but also > >>>> optionally a timezone string field. I believe the timezone field is > >>>> unique to the Rust implementation, as I don’t find it in the C/C++ and > >>>> Python docs. At the same time, in reality if the timezone field is non > >>>> null, this is not well supported at all in the current code. Functions > >>>> returning timestamps pretty much all return a null timezone, for example, > >>>> and don’t allow the timezone to be specified. > >>>> > >>>> The proposal would be to eliminate the timezone field and bring the Rust > >>>> Arrow timestamp type in line with that of the other language > >>>> implementations, also simplifying implementation. It seems this is in > >>>> line with direction of other projects (Parquet, Spark, and most DBs have > >>>> timestamp types which do not have explicit timezones or are implicitly > >>> UTC). > >>>> > >>>> Please feel free to see > >>>> https://github.com/apache/arrow-datafusion/issues/686 < > >>>> https://github.com/apache/arrow-datafusion/issues/686> > >>>> (Or would it be better to discuss here in mailing list?) > >>>> > >>>> Cheers! > >>>> Evan > >>> >