Hi, > ORC has long had a timestamp format. If extra attributes are needed on a > timestamp, as long as the default "no metadata" value isn't changed, then at > the file level things should be OK. > > more problematic is: what would happen to an existing app reading in > timestamps and ignoring any extra attributes. That way lies trouble
Maybe it would be best if the freshly introduced more explicit types were not forwards-compatible. To be more precise, it would be enough if only the "new" semantics were not forwards-compatible, it is fine if older readers can read the "already existing" semantics, since that is what they expect. Of course, this more fine-grained control is only possible if there is a single "already existing" semantics only. Whether that's the case or not depends on the file format as well. > Talk to the format groups sooner rather than later Thanks for the suggestion, I will write a small summary from that perspective soon and contact the file format groups. I have Avro, Parquet and ORC in mind. Any other file format group I should contact? I plan to reach out to Arrow and Kudu as well. (Although strictly speaking these are not file formats, yet they have their own type systems as well.) > What does Arrow do in this world, incidentally? Arrow has a bit more options than just UTC-normalized or timezone-agnostic. It supports arbitrary timezones as well: /// The time zone is a string indicating the name of a time zone [...] /// /// * If the time zone is null or equal to an empty string, the data is "time /// zone naive" and shall be displayed *as is* to the user, not localized /// to the locale of the user. [...] /// /// * If the time zone is set to a valid value, values can be displayed as /// "localized" to that time zone, even though the underlying 64-bit /// integers are identical to the same data stored in UTC. [...] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L162 Br, Zoltan On Wed, Jan 2, 2019 at 5:36 PM Steve Loughran <ste...@hortonworks.com> wrote: > > OK, I've seen the document now. Probably the best summary of timestamps out > there I've ever seen. > > Irrespective of what historical stuff has done, the goal should be "make > everything consistent enough that cut and paste SQL queries over the same > data works" and "you shouldn't have to care about the persistence format *or > which app created the data* > > What does Arrow do in this world, incidentally? > > > On 2 Jan 2019, at 11:48, Steve Loughran <ste...@hortonworks.com> wrote: > > > > On 17 Dec 2018, at 17:44, Zoltan Ivanfi <z...@cloudera.com.INVALID> wrote: > > Hi, > > On Sun, Dec 16, 2018 at 4:43 AM Wenchen Fan <cloud0...@gmail.com> wrote: > > Shall we include Parquet and ORC? If they don't support it, it's hard for > general query engines like Spark to support it. > > > For each of the more explicit timestamp types we propose a single > semantics regardless of the file format. Query engines and other > applications must explicitly support the new semantics, but it is not > strictly necessary to extend or modify the file formats themselves, > since users can declare the desired semantics directly in the end-user > applications: > > - In SQL they would do so by using the more explicit timestamp types > as detailed in the proposal. And since the SQL engines in question > share the same metastore, users only have to define/update the SQL > schema once to achieve interoperability in SQL. > > - Other applications will have to add support for the different > semantics, but due to the large number of such applications, we can > not coordinate all of that effort. Hopefully though, if we add support > in the three major Hadoop SQL engines, other applications will follow > suit. > > - Spark, specifically, falls into both of the categories mentioned > above. It supports SQL queries, where it gets the benefit of the SQL > schemas shared via the metastore. It also supports reading data files > directly, where the correct timestamp semantics to use would have to > be declared programmatically by the user/consumer of the API. > > That being said, although not strictly necessary, it is beneficial to > store the semantics in some file-level metadata as well. This allows > writers to record the intended semantics of timestamps and readers to > recognize it, so no input is needed from the user when data is > ingested from or exported to other tools. It will still require > explicit support from the applications though. Parquet does have such > metadata about the timestamp semantics: the isAdjustedToUTC field is > part of the new parametric timestamp logical type. True means Instant > semantics, while false means LocalDateTime semantics. > > > I support the idea of adding similar metadata to other file formats as > well, but I consider that to be a second step. > > > ORC has long had a timestamp format. If extra attributes are needed on a > timestamp, as long as the default "no metadata" value isn't changed, then at > the file level things should be OK. > > more problematic is: what would happen to an existing app reading in > timestamps and ignoring any extra attributes. That way lies trouble > > First I would like to > reach an agreement in how different SQL timestamp types should behave. > (Until we follow this up with that second step, file formats with a > single non-parametric timestamp type can store arbitrary semantics > too, users just have to be aware of what timestamp semantics were used > when they create a SQL table over the data or read it in non-SQL > applications. Alternatively, we may limit the new types to file > formats with timestamp semantics metadata and postpone support for > other file formats until semantics metadata is added to them.) > > > Talk to the format groups sooner rather than later > > > > Br, > > Zoltan > > > On Wed, Dec 12, 2018 at 3:36 AM Li Jin <ice.xell...@gmail.com> wrote: > > > Of course. I added some comments in the doc. > > On Tue, Dec 11, 2018 at 12:01 PM Imran Rashid <im...@therashids.com> wrote: > > > Hi Li, > > thanks for the comments! I admit I had not thought very much about python > support, its a good point. But I'd actually like to clarify one thing about > the doc -- though it discusses java types, the point is actually about having > support for these logical types at the SQL level. The doc uses java names > instead of SQL names just because there is so much confusion around the SQL > names, as they haven't been implemented consistently. Once there is support > for the additional logical types, then we'd absolutely want to get the same > support in python. > > Its great to hear there are existing python types we can map each behavior > to. Could you add a comment on the doc on each of the types, mentioning the > equivalent in python? > > thanks, > Imran > > On Fri, Dec 7, 2018 at 1:33 PM Li Jin <ice.xell...@gmail.com> wrote: > > > Imran, > > Thanks for sharing this. When working on interop between Spark and > Pandas/Arrow in the past, we also faced some issues due to the different > definitions of timestamp in Spark and Pandas/Arrow, because Spark timestamp > has Instant semantics and Pandas/Arrow timestamp has either LocalDateTime or > OffsetDateTime semantics. (Detailed discussion is in the PR: > https://github.com/apache/spark/pull/18664#issuecomment-316554156.) > > For one I am excited to see this effort going but also would love to see > interop of Python to be included/considered in the picture. I don't think it > adds much to what has already been proposed already because Python timestamps > are basically LocalDateTime or OffsetDateTime. > > Li > > > > On Thu, Dec 6, 2018 at 11:03 AM Imran Rashid <iras...@cloudera.com.invalid> > wrote: > > > Hi, > > I'd like to discuss the future of timestamp support in Spark, in particular > with respect of handling timezones in different SQL types. In a nutshell: > > * There are at least 3 different ways of handling the timestamp type across > timezone changes > * We'd like Spark to clearly distinguish the 3 types (it currently implements > 1 of them), in a way that is backwards compatible, and also compliant with > the SQL standard. > * We'll get agreement across Spark, Hive, and Impala. > > Zoltan Ivanfi (Parquet PMC, also my coworker) has written up a detailed doc, > describing the problem in more detail, the state of various SQL engines, and > how we can get to a better state without breaking any current use cases. The > proposal is good for Spark by itself. We're also going to the Hive & Impala > communities with this proposal, as its better for everyone if everything is > compatible. > > Note that this isn't proposing a specific implementation in Spark as yet, > just a description of the overall problem and our end goal. We're going to > each community to get agreement on the overall direction. Then each > community can figure out specifics as they see fit. (I don't think there are > any technical hurdles with this approach eg. to decide whether this would be > even possible in Spark.) > > Here's a link to the doc Zoltan has put together. It is a bit long, but it > explains how such a seemingly simple concept has become such a mess and how > we can get to a better state. > > https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit#heading=h.dq3b1mwkrfky > > Please review the proposal and let us know your opinions, concerns and > suggestions. > > thanks, > Imran > > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org