I don't think it makes a lot of sense to add a INT128 physical type while it could be FLBA(16) instead. New physical types are larger implementation burden than new logical types.
Regards Antoine. On Mon, 7 Jul 2025 10:24:50 +0300 Alkis Evlogimenos <alkis.evlogime...@databricks.com.INVALID> wrote: > In the 25th June meeting the predominant suggestion was to make INTERVAL a > logical type with physical type being INT64 nanos. This does not cover the > full 10k year ANSI SQL range and the rebuttal for that is that we could add > INT128 physical type that could also be used for that and other things. > > Why wouldn't the above work? > > On Mon, Jul 7, 2025 at 9:37 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > I think I've lost the thread a little bit on this discussion. I'd like to > > try to summarize my understanding of the issues at hand and current > > proposals. > > > > As a short summary, my understanding of the requirements we want for a new > > parquet day-time interval type are: > > > > 1. Represent +/- 10K years for at least nano-second granularity (IIUC the > > range is specified by ANSI SQL. The granularity is picked for convenience > > that covers most use cases). > > 2. Allow pruning via min-max statistics > > > > Shortcomings of the existing interval type: > > 1. Millisecond granularity. > > 2. Only supports positive values (all integer in FLBA of 12 bytes are > > unsigned). > > 3. No sort order > > > > Proposals: > > 1. Fixed width integer duration type (assumes days are always exactly > > 86400 seconds) > > Pros: > > - Easy to model. > > - Maps easily to ANSI SQL Day Time Interval representations that > > several engines use. > > Cons: > > - Doesn't match well with any Arrow type. > > - Doesn't capture semantics for engines that treat day as a calendar > > type. > > > > 2. IIUC, Something based off of Arrow's MonthDayNanos type. I apologize, > > I'm not sure if this is what is being proposed but here are the approaches > > i see with this: > > - FLBA 16 byte type exactly like arrow > > - Shredded (month, day and nanos each have a separate field under a > > struct). The main complication with this would be statistics since it is > > all on the leaf fields, so there is either some information loss OR a > > convention for recording statistics on one of the leafs need to followed). > > > > - Pros: > > - Potentially more efficient storage > > - Maps directly to one of Arrow's Interval types. > > - Cons: > > - Comparison in general is challenging. Perhaps this can be solved by > > new sort order that clarify relationships between fields or additional > > metadata on the logic type. e.g. UNORDERED (Arrow semantics) or YEAR-MONTH > > (only month field is expected to be populated and values are ordered on it) > > or DAY-TIME ordered (A day means 86400 seconds and nanoseconds normalized > > to be less then 24 hours). > > - Doesn't conform to the notion of ANSI SQL interval types (which > > have two separate representations for Year/Month and Day Time). So some > > parquet files might not be readable by ANSI SQL. > > - For the likely uncommon case (intervals ~+/- 200 years) might > > require more transformation to an engine's internal representation. > > - Only works for the SQL standard if days are assumed to be 86400 > > seconds (right now fields are specified as independent of each other). > > > > > > Did I get the requirements right? Are there some other options people were > > thinking about? > > > > Thanks, > > Micah > > > > > > > > On Fri, Jul 4, 2025 at 6:49 AM Antoine Pitrou > > <antoine-+zn9apsxkcednm+yrof...@public.gmane.org> wrote: > > > > > On Thu, 3 Jul 2025 17:22:27 +0100 > > > Raphael Taylor-Davies > > > <r.taylordav...@googlemail.com.INVALID> > > > wrote: > > > > Hi, > > > > > > > > I may be misunderstanding something, but it appears that the motivation > > > > for this effort is that the current interval type represents a superset > > > > of the functionality required by ANSI SQL, and therefore can cause > > > > compatibility problems for some databases that only support the minimum > > > > required by ANSI SQL? > > > > > > While I understand the desire to be able to represent all values > > > allowable in ANSI SQL, I really don't understand why our types should > > > not be allowed to represent any values *outside* of the range allowed > > > in ANSI SQL. > > > > > > Please let's be mindful that Parquet is not useful only for SQL-type > > > workloads. Besides, ANSI SQL itself might evolve and we don't want to > > > add another Interval type in a few years because the one we're > > > current specifying ends up too tight. > > > > > > (if some people want to make sure that values don't fall outside the > > > ANSI SQL range, they can write a validation pass for it; no need to > > > burden the Parquet *format* with such contraints) > > > > > > Regards > > > > > > Antoine. > > > > > > > > > > > >