Re: proposal for expanded & consistent timestamp types

Zoltan Ivanfi Tue, 08 Jan 2019 05:36:05 -0800

Hi,

> ORC has long had a timestamp format. If extra attributes are needed on a 
> timestamp, as long as the default "no metadata" value isn't changed, then at 
> the file level things should be OK.
>
> more problematic is: what would happen to an existing app reading in 
> timestamps and ignoring any extra attributes. That way lies trouble


Maybe it would be best if the freshly introduced more explicit types
were not forwards-compatible. To be more precise, it would be enough
if only the "new" semantics were not forwards-compatible, it is fine
if older readers can read the "already existing" semantics, since that
is what they expect. Of course, this more fine-grained control is only
possible if there is a single "already existing" semantics only.
Whether that's the case or not depends on the file format as well.

> Talk to the format groups sooner rather than later

Thanks for the suggestion, I will write a small summary from that
perspective soon and contact the file format groups. I have Avro,
Parquet and ORC in mind. Any other file format group I should contact?
I plan to reach out to Arrow and Kudu as well. (Although strictly
speaking these are not file formats, yet they have their own type
systems as well.)

> What does Arrow do in this world, incidentally?

Arrow has a bit more options than just UTC-normalized or
timezone-agnostic. It supports arbitrary timezones as well:

/// The time zone is a string indicating the name of a time zone [...]
///
/// * If the time zone is null or equal to an empty string, the data is "time
/// zone naive" and shall be displayed *as is* to the user, not localized
/// to the locale of the user. [...]
///
/// * If the time zone is set to a valid value, values can be displayed as
/// "localized" to that time zone, even though the underlying 64-bit
/// integers are identical to the same data stored in UTC. [...]

https://github.com/apache/arrow/blob/master/format/Schema.fbs#L162

Br,

Zoltan



On Wed, Jan 2, 2019 at 5:36 PM Steve Loughran <ste...@hortonworks.com> wrote:
>
> OK, I've seen the document now. Probably the best summary of timestamps out 
> there I've ever seen.
>
> Irrespective of what historical stuff has done, the goal should be "make 
> everything consistent enough that cut and paste SQL queries over the same 
> data works" and "you shouldn't have to care about the persistence format *or 
> which app created the data*
>
> What does Arrow do in this world, incidentally?
>
>
> On 2 Jan 2019, at 11:48, Steve Loughran <ste...@hortonworks.com> wrote:
>
>
>
> On 17 Dec 2018, at 17:44, Zoltan Ivanfi <z...@cloudera.com.INVALID> wrote:
>
> Hi,
>
> On Sun, Dec 16, 2018 at 4:43 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>
> Shall we include Parquet and ORC? If they don't support it, it's hard for 
> general query engines like Spark to support it.
>
>
> For each of the more explicit timestamp types we propose a single
> semantics regardless of the file format. Query engines and other
> applications must explicitly support the new semantics, but it is not
> strictly necessary to extend or modify the file formats themselves,
> since users can declare the desired semantics directly in the end-user
> applications:
>
> - In SQL they would do so by using the more explicit timestamp types
> as detailed in the proposal. And since the SQL engines in question
> share the same metastore, users only have to define/update the SQL
> schema once to achieve interoperability in SQL.
>
> - Other applications will have to add support for the different
> semantics, but due to the large number of such applications, we can
> not coordinate all of that effort. Hopefully though, if we add support
> in the three major Hadoop SQL engines, other applications will follow
> suit.
>
> - Spark, specifically, falls into both of the categories mentioned
> above. It supports SQL queries, where it gets the benefit of the SQL
> schemas shared via the metastore. It also supports reading data files
> directly, where the correct timestamp semantics to use would have to
> be declared programmatically by the user/consumer of the API.
>
> That being said, although not strictly necessary, it is beneficial to
> store the semantics in some file-level metadata as well. This allows
> writers to record the intended semantics of timestamps and readers to
> recognize it, so no input is needed from the user when data is
> ingested from or exported to other tools. It will still require
> explicit support from the applications though. Parquet does have such
> metadata about the timestamp semantics: the isAdjustedToUTC field is
> part of the new parametric timestamp logical type. True means Instant
> semantics, while false means LocalDateTime semantics.
>
>
> I support the idea of adding similar metadata to other file formats as
> well, but I consider that to be a second step.
>
>
> ORC has long had a timestamp format. If extra attributes are needed on a 
> timestamp, as long as the default "no metadata" value isn't changed, then at 
> the file level things should be OK.
>
> more problematic is: what would happen to an existing app reading in 
> timestamps and ignoring any extra attributes. That way lies trouble
>
> First I would like to
> reach an agreement in how different SQL timestamp types should behave.
> (Until we follow this up with that second step, file formats with a
> single non-parametric timestamp type can store arbitrary semantics
> too, users just have to be aware of what timestamp semantics were used
> when they create a SQL table over the data or read it in non-SQL
> applications. Alternatively, we may limit the new types to file
> formats with timestamp semantics metadata and postpone support for
> other file formats until semantics metadata is added to them.)
>
>
> Talk to the format groups sooner rather than later
>
>
>
> Br,
>
> Zoltan
>
>
> On Wed, Dec 12, 2018 at 3:36 AM Li Jin <ice.xell...@gmail.com> wrote:
>
>
> Of course. I added some comments in the doc.
>
> On Tue, Dec 11, 2018 at 12:01 PM Imran Rashid <im...@therashids.com> wrote:
>
>
> Hi Li,
>
> thanks for the comments!  I admit I had not thought very much about python 
> support, its a good point.  But I'd actually like to clarify one thing about 
> the doc -- though it discusses java types, the point is actually about having 
> support for these logical types at the SQL level.  The doc uses java names 
> instead of SQL names just because there is so much confusion around the SQL 
> names, as they haven't been implemented consistently.  Once there is support 
> for the additional logical types, then we'd absolutely want to get the same 
> support in python.
>
> Its great to hear there are existing python types we can map each behavior 
> to.  Could you add a comment on the doc on each of the types, mentioning the 
> equivalent in python?
>
> thanks,
> Imran
>
> On Fri, Dec 7, 2018 at 1:33 PM Li Jin <ice.xell...@gmail.com> wrote:
>
>
> Imran,
>
> Thanks for sharing this. When working on interop between Spark and 
> Pandas/Arrow in the past, we also faced some issues due to the different 
> definitions of timestamp in Spark and Pandas/Arrow, because Spark timestamp 
> has Instant semantics and Pandas/Arrow timestamp has either LocalDateTime or 
> OffsetDateTime semantics. (Detailed discussion is in the PR: 
> https://github.com/apache/spark/pull/18664#issuecomment-316554156.)
>
> For one I am excited to see this effort going but also would love to see 
> interop of Python to be included/considered in the picture. I don't think it 
> adds much to what has already been proposed already because Python timestamps 
> are basically LocalDateTime or OffsetDateTime.
>
> Li
>
>
>
> On Thu, Dec 6, 2018 at 11:03 AM Imran Rashid <iras...@cloudera.com.invalid> 
> wrote:
>
>
> Hi,
>
> I'd like to discuss the future of timestamp support in Spark, in particular 
> with respect of handling timezones in different SQL types.   In a nutshell:
>
> * There are at least 3 different ways of handling the timestamp type across 
> timezone changes
> * We'd like Spark to clearly distinguish the 3 types (it currently implements 
> 1 of them), in a way that is backwards compatible, and also compliant with 
> the SQL standard.
> * We'll get agreement across Spark, Hive, and Impala.
>
> Zoltan Ivanfi (Parquet PMC, also my coworker) has written up a detailed doc, 
> describing the problem in more detail, the state of various SQL engines, and 
> how we can get to a better state without breaking any current use cases.  The 
> proposal is good for Spark by itself.  We're also going to the Hive & Impala 
> communities with this proposal, as its better for everyone if everything is 
> compatible.
>
> Note that this isn't proposing a specific implementation in Spark as yet, 
> just a description of the overall problem and our end goal.  We're going to 
> each community to get agreement on the overall direction.  Then each 
> community can figure out specifics as they see fit.  (I don't think there are 
> any technical hurdles with this approach eg. to decide whether this would be 
> even possible in Spark.)
>
> Here's a link to the doc Zoltan has put together.  It is a bit long, but it 
> explains how such a seemingly simple concept has become such a mess and how 
> we can get to a better state.
>
> https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit#heading=h.dq3b1mwkrfky
>
> Please review the proposal and let us know your opinions, concerns and 
> suggestions.
>
> thanks,
> Imran
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: proposal for expanded & consistent timestamp types

Reply via email to